Abstract

Statistical postprocessing is a method to improve the direct model output from numerical weather prediction (NWP) models. In this assignment we use linear regression models that use observed air temperature at 2 meters (T2Obs) and the dewpoint temperature at 2 meters (D2Obs) to improve the forecast capability of the NWP models. We develop linear regression model for the air temperature at 2 meters (T2) and the dewpoint temperature at 2 meters (D2). We compare our model with the standard models available in R such as lm, step and glm. We then use the predicted T2 and D2 to calculate the relative humidity, the root mean square error (rmse) for the RH for ECMWF NWP model and for the linear models. The results show that linear regression models reduces the rmse in RH at most by a factor of 2. The data used for analysis is the weather data for the Kumpula weather station for the four seasons, for 64 forecast periods produced by ECMWF and the T2Obs and D2Obs available at the Finnish Meteorological Institute.

Brief description of the Research question

Statistical postprocessing is a method to improve the direct model output from numerical weather prediction models. It enhances the NWP’s ability to forecast by relating the model output to observational or additional model data using statistical methods. It incorporates the local observations to the model output to reduce the systematic biases of the model.

Model Output Statistics (MOS) is a type of statistical post-processing technique. MOS was defined by Glahn and Lowry in 1972 as the following: Model Output Statistics is an objective weather forecasting technique which consists of determining a statistical relationship between a predictand and variables forecast by a numerical model at some projection time(s). It is, in effect, the determination of the “weather related” statistics of a numerical model.

The Finnish Meteorological Institute (FMI) has started a project called POSSE that aims to improve the accuracy of weather forecasts by incorporating the raw forecasts and local observations with statistical models. In POSSE, the methodology for statistical post-processing of weather forecasts is the MOS and it consists of determining a statistical relationship between the response variable, for example, the observed temperature, and the predictor variables such as mean sea level pressure,high cloud cover etc. The vision of POSSE is to enable FMI to produce cost-efficient, reliable, and competitive forecasts in Finland and around the world.

In the POSSE project, work has already been done in improving the forecasts by incorporating the observed temperature at 2 meters with the numerical weather model obtained from ECMWF (European Centre for Medium-Range Weather Forecasts). This assignment problem is a part of my work in the POSSE project where we analyze the interdependence of variables of the forecast model.

Specifically in this assignment, we carry out the interdependence of T2, the air temperature at 2 meters, and D2, the dewpoint temperature at 2 meters.

Dewpoint temperature indicates the amount moisture in the air. A high dewpoint indicate a higher moisture content of the air at a given temperature. The dewpoint temperature is defined as the the temperature at which the air cannot hold all the water vapour which is mixed with it and it has to condense some water vapour. The dew point temperature is always lower than or equal to the air temperature.

Relative humidity (RH) indicates how moist the air is. The air has high high relative humidity, when the air temperature and the dew point temperatures are very close. When there is a large difference between air and dew point temperatures the air has a low relative humidity. The dew point is associated with relative humidity. A high relative humidity indicates that the dew point is closer to the current air temperature. Relative humidity of 100% indicates the dew point is equal to the current air temperature and that the air is maximally saturated with water. When the dew point remains constant and temperature increases, relative humidity decreases.

We have the ECMWF model outputs on T2, D2 and also the observed temperature and dewpoint temperature at 2 meters, T2Obs and D2Obs respectively. We develop a linear regression model with response variable T2Obs and the model forecast from ECMWF and predict T2. Similar linear regression is carried out with response variable D2Obs and the model forecast from ECMWF to predict D2. We then calculate the Relative Humidity (RH) which is a function of T2 and D2. We compare our linear regression model with the linear regression models available in R simple linear regression(lm), linear regression with step method and linear regression with cross validation (glm.cv).

The results of this work are the following: 1. Our linear regression model is comparable with the lm, lmstep and glm models. 2. The errors in the prediction of variables amplifies the errors in the variable which depend on the predicted variables. For example, errors in relative humidity amplifies the errors in the predicted air temperature at 2 meters (T2) and the dewpoint temperature at 2 meters (D2). 3. The results show that linear regression models reduces the rmse in RH at most by a factor of 2. 4. The statistical postprocessing with MOS methodology where we use the linear regression model with observed air temperature at 2 meters (T2Obs) and the dewpoint temperature at 2 meters (D2Obs) improves the forecast model.

My Contributions

The work on interdependence of variable has been assigned to me by Jussi Ylhäisi (jussi.ylhaisi@fmi.fi) in January 2017. The linear regression model on T2 using the T2Obs and the the ECMWF raw model data is already existed in FMI. I developed the linear regression model on D2 with D2Obs and ECMWF raw model data and calculated the RH based on T2 and D2.

Discussions with Jussi Ylhäisi and other Posse project members were helped me to study this problem deeply. Also I have influenced by the R coding style that exist in FMI and also Google R code style.

The work done in this assignment, developing and analyzing the linear regression model, namely mymodel, is fully done by me. I did the development of linear regression models for T2 and D2 and the calculation of RH on the Kumpula weather station. Jussi Ylhäisi provided the original csv file “station2998_all_data_with_timestamps.csv”. I wrote the create_weather_data.R to get the clean data (without NAs and constant variables) and the cleaned data is saved in the file “data/station2998_T2_D2_Obs_model_data_with_timestamps.csv”.


Weather data

The weather data is a time series data, data taken at discrete time intervals. To show an example of a time series data, we plot the observed 2 meter temperature, called T2Obs, in degree Kelvin for the years from 2011 to 2015. As we know, the observed temperature shows a seasonal pattern, i.e, varying from high temperatures in Summer to low temperatures in Winter and also a diurnal pattern varying according to the hours of the day.

library(ggplot2)
timeseries <-  read.table(file = "data/timeseries_example.csv", sep = ",", header = TRUE)

timeseries$timestamp = as.Date(strptime(timeseries$timestamp, "%Y-%m-%d%H:%M:%S"))
qplot(timeseries$timestamp, timeseries$T2Obs, xlab="year", ylab="Observed temperature  (Kelvin)")

The description of the weather data

The weather data we use in this assignment is based on ECMWF’s 10-day forecasts for the Kumpula station (Station id 2998). There are 5 station and time related variables (station_id, season, analysis_time, forecast period and timestamp) and 25 predictor variables which describe, for example, temperature, dew point, relative humidity forecasts etc. We have also two response variables, the actual observations of the two-meter temperature denoted as T2_Obs and the dew point temperature D2 at 2 meters named as D2Obs. The ECMWF 10-day forecast includes forecasts for T2 and D2. We have the weather data from 2011-12-01 to 2015-10-01.

The T2Obs and D2Obs are available form [FMI opendata portal] (https://en.ilmatieteenlaitos.fi/open-data). We use the model data from the Atmospheric Model high resolution 10-day forecast (HRES) from ECMWF.

The variables of the raw forecast model from ECMWF (for the Single level -forecast) are defined here. The definitions of the predictor variables we use are given below.

No. Abbreviation Description
1 station_id ECMWF id for the stations (Check)
2 season Winter, Spring, Summer and Autumn( 1-4) (Check)
3 analysis_time (Check) OOUTC and 12UTC (Coordinated Universal Time)
4 forecast_period Forecast period. (Check)
5 timestamp Timestamps 2011-12-06 12:00:00 to 2015-10-01 12:00:00
6 T2_Obs Observed temperature at 2 meters
7 MSL Mean sea level pressure
8 T2 2 meter temperature (ECMWF:2T)
9 D2 2 meter dewpoint temperature (ECMWF: 2D)
10 U10 10 meter U-velocity (ECMWF: 10U)
11 V10 10 meter V-velocity (ECMWF: 10V)
12 TP Total precipitation
13 LCC Low cloud cover
14 MCC Medium cloud cover
15 HCC High cloud cover
16 SKT Skin temperature (temperature on the surface of the Earth)
17 FG10_3 10 meter wind gust in the last 3 hours (ECMWF:10FG3
18 MX2T3 Maximum temperature at 2 meter in the last 3 hours
19 RH_950 Relative humidity at 950hPa
20 T_950 Temperature at 950hPa
21 RH_925 Relative humidity at 925hPa
22 T_925 Temperature at 925hPa
23 RH_850 Relative humidity at 850hPa
24 T_850 Temperature at 850hPa
25 RH_700 Relative humidity at 700hPa
26 T_700 Temperature at 700hPa
27 RH_500 Relative humidity at 500hPa
28 T_500 Temperature at 500hPa
29 T2_M1 2 meter temperature at previous forecast period
30 T_950_M1 Temperature at 950hPa at previous forecast period
31 T_925_M1 Temperature at 925hPa at previous forecast period
32 DECLINATION Declination
33 D2_M1 2 meter dew point temperature at previous forecast period
34 D2_Obs Observed dew point temperature at 2 meters

Loading the data

The original data file is in “station2998_all_data_with_timestamps.csv”. It has 299520 observations of 34 variables. Unfortunately the size of this file exceeds the GitHub’s recommended maximum file size, I cannot upload it to my GitHub directory. The file has missing values in it, represented by “NA”. We remove the variables with a large number of NAs (no. of NAs > no. of observations/4). Two variables “FG10_3” (10 meter wind gust in the last 3 hours) and “MX2T3” (Maximum temperature at 2 meter in the last 3 hours) are removed as they have large number of NAs. We remove the variables of constant variance, but this file has no variable with constant variance (except the station_id and it is not removed). We also remove the NAs from the other 32 variables. The program, create_weather_data.R does the above functions and save the “cleaned” data in the file “data/station2998_T2_D2_Obs_model_data_with_timestamps.csv”

The “cleaned” weather data, T2_D2_Obs_model_data_with_timestamps, is loaded from the file “data/station2998_T2_D2_Obs_model_data_with_timestamps.csv”. Weather data is a multidimensional data. It has 290982 observations of 32 variables. The first five columns of the weather data are the station_id, season, analysis_time, forecast period and timestamps. We choose the Kumpula station which has a station id of 2998. We have data for all the four seasons, namely, Winter, Spring, Summer and Autumn represented by the values 1,2,3, and 4 respectively.

We are using 10-day forecasts, we have 2 to 65 different forecast periods ranging from +0 to +240. Between +0 and +144, the step size is 3 hours, and between +144 and +240, the step size is 6 hours. When the forecast period has a value 2, we get a 3-hour forecast data and if its value is 8, we get a 24-hour or one day forecast data. The forecast period with value 65 represents a 10-day forecast. The forecasts are created two times a day at 00:00 UTC (Coordinated Universal Time) and 12:00 UTC. This variable is called analysis time has two values 1 and 2 corresponding to 00:00UTC and 12:00 UTC.

The response variable T2Obs is at column 6 and the other response variable D2Obs is the last column of the weather data. The columns 8 and 9 gives the ECMF model output for T2 and D2. From the ECMWF model we also have the data for temperature and relative humidity at various pressure levels such as 950 hPa(hecto Pascal), 925 hPa, 850 hPa, 700 hPa and 500 hPa. We also have data on temperature T2_M1, T_950_M1 (temperature at 950 hPa), T_925_M1 and dew point temperature D2_M1 of the corresponding variables at the previous forecast period.

ECMWF model also provides the temperature at the surface of Earth called skin temperature (SKT), wind velocity at 10 meters (U_10 and V_10). Declination data is available and it is defined as the angular distance north or south from the celestial equator measured along a great circle passing through the celestial pole.

we load the weather data, T2_D2_Obs_model_data_with_timestamps and below can see the structure of the data.

# loading the data from the file 
T2_D2_Obs_model_data_with_timestamps = read.table(file = "data/station2998_T2_D2_Obs_model_data_with_timestamps.csv", sep = ",", header = TRUE)
dim(T2_D2_Obs_model_data_with_timestamps)
## [1] 290982     32
colnames(T2_D2_Obs_model_data_with_timestamps)
##  [1] "station_id"      "season"          "analysis_time"  
##  [4] "forecast_period" "timestamp"       "T2Obs"          
##  [7] "MSL"             "T2"              "D2"             
## [10] "U10"             "V10"             "TP"             
## [13] "LCC"             "MCC"             "HCC"            
## [16] "SKT"             "RH_950"          "T_950"          
## [19] "RH_925"          "T_925"           "RH_850"         
## [22] "T_850"           "RH_700"          "T_700"          
## [25] "RH_500"          "T_500"           "T2_M1"          
## [28] "T_950_M1"        "T_925_M1"        "DECLINATION"    
## [31] "D2_M1"           "D2Obs"
str(T2_D2_Obs_model_data_with_timestamps)
## 'data.frame':    290982 obs. of  32 variables:
##  $ station_id     : int  2998 2998 2998 2998 2998 2998 2998 2998 2998 2998 ...
##  $ season         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ analysis_time  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ forecast_period: int  5 5 5 5 5 5 5 5 5 5 ...
##  $ timestamp      : Factor w/ 11184 levels "2011-12-01 03:00:00",..: 4 12 20 28 36 44 52 60 68 76 ...
##  $ T2Obs          : num  279 280 276 278 276 ...
##  $ MSL            : num  101258 99586 99785 97798 97875 ...
##  $ T2             : num  279 279 276 277 276 ...
##  $ D2             : num  277 277 272 276 274 ...
##  $ U10            : num  4.8 5 4.2 8.4 3.9 5.4 1.4 1.6 -5.2 5.5 ...
##  $ V10            : num  5.4 4.3 3 7.3 5.4 0.5 5.9 7.2 11.5 6.9 ...
##  $ TP             : num  0 0.03 0 0.27 0.23 0 0.03 0.17 0.2 0.43 ...
##  $ LCC            : num  0.42 0.29 0.01 0.58 0.24 0 0.56 0.58 0.53 0.85 ...
##  $ MCC            : num  0 1 0 0.27 0.18 0.01 0.04 0.75 1 0.96 ...
##  $ HCC            : num  1 1 0 0.91 0.78 0.94 0.74 0.91 1 1 ...
##  $ SKT            : num  279 278 276 278 276 ...
##  $ RH_950         : num  94.5 87.2 67.8 94.9 97 81.6 77.2 95.2 88.7 85.9 ...
##  $ T_950          : num  275 276 274 276 274 ...
##  $ RH_925         : num  52.2 81.6 74.7 97.4 93.4 79.7 83.2 84.3 92.1 94.9 ...
##  $ T_925          : num  275 275 272 274 272 ...
##  $ RH_850         : num  7.9 63.9 23 84.3 89.2 ...
##  $ T_850          : num  273 270 268 270 268 ...
##  $ RH_700         : num  16.5 70.1 17.9 50.6 62.5 ...
##  $ T_700          : num  264 260 257 260 258 ...
##  $ RH_500         : num  53.9 100.3 22.3 69.8 59.3 ...
##  $ T_500          : num  246 248 246 240 237 ...
##  $ T2_M1          : num  278 279 275 278 276 ...
##  $ T_950_M1       : num  276 276 273 276 274 ...
##  $ T_925_M1       : num  276 275 272 274 273 ...
##  $ DECLINATION    : num  -13.6 -13.9 -14.3 -14.6 -14.9 ...
##  $ D2_M1          : num  276 277 272 277 274 ...
##  $ D2Obs          : num  278 278 271 275 274 ...

Summary of the weather data set

We can find the minimum, maximum, and quantile values of the weather data for the columns 6:32 is shown here. We can see that the T2Obs has a mean value of 280K and varies from a minimum value of 247.7 K to a maximum of 303.6K. D2Obs has a mean value of 280K and varies from 244.8K to a maximum of 293.1K

# summary of the weather dataset
summary(T2_D2_Obs_model_data_with_timestamps[,-c(1:5)])
##      T2Obs            MSL               T2              D2       
##  Min.   :247.7   Min.   : 94622   Min.   :246.7   Min.   :244.2  
##  1st Qu.:274.1   1st Qu.:100589   1st Qu.:273.3   1st Qu.:270.6  
##  Median :279.6   Median :101300   Median :278.9   Median :275.7  
##  Mean   :280.0   Mean   :101280   Mean   :279.3   Mean   :275.9  
##  3rd Qu.:287.1   3rd Qu.:101984   3rd Qu.:286.6   3rd Qu.:282.8  
##  Max.   :303.6   Max.   :105590   Max.   :300.7   Max.   :293.9  
##       U10                V10                TP                LCC        
##  Min.   :-15.2000   Min.   :-14.100   Min.   :-0.03000   Min.   :0.0000  
##  1st Qu.: -1.8000   1st Qu.: -1.400   1st Qu.: 0.00000   1st Qu.:0.0100  
##  Median :  1.3000   Median :  1.100   Median : 0.00000   Median :0.2700  
##  Mean   :  0.8083   Mean   :  1.032   Mean   : 0.08565   Mean   :0.4265  
##  3rd Qu.:  3.5000   3rd Qu.:  3.300   3rd Qu.: 0.05000   3rd Qu.:0.9300  
##  Max.   : 13.9000   Max.   : 16.500   Max.   : 5.75000   Max.   :1.0000  
##       MCC              HCC              SKT            RH_950      
##  Min.   :0.0000   Min.   :0.0000   Min.   :248.0   Min.   :  1.10  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:272.9   1st Qu.: 61.30  
##  Median :0.0900   Median :0.1000   Median :278.8   Median : 79.50  
##  Mean   :0.3202   Mean   :0.3809   Mean   :279.3   Mean   : 76.16  
##  3rd Qu.:0.6800   3rd Qu.:0.9200   3rd Qu.:286.6   3rd Qu.: 94.60  
##  Max.   :1.0000   Max.   :1.0000   Max.   :302.7   Max.   :109.10  
##      T_950           RH_925           T_925           RH_850      
##  Min.   :244.2   Min.   :  0.90   Min.   :243.8   Min.   :  0.40  
##  1st Qu.:271.3   1st Qu.: 60.70   1st Qu.:270.4   1st Qu.: 50.50  
##  Median :277.0   Median : 78.10   Median :276.0   Median : 72.90  
##  Mean   :277.3   Mean   : 74.92   Mean   :276.2   Mean   : 67.24  
##  3rd Qu.:284.3   3rd Qu.: 93.80   3rd Qu.:282.9   3rd Qu.: 88.10  
##  Max.   :299.0   Max.   :114.50   Max.   :297.2   Max.   :120.20  
##      T_850           RH_700           T_700           RH_500      
##  Min.   :243.1   Min.   :  0.50   Min.   :238.3   Min.   :  0.90  
##  1st Qu.:267.6   1st Qu.: 24.70   1st Qu.:259.6   1st Qu.: 26.20  
##  Median :273.1   Median : 50.20   Median :265.7   Median : 50.40  
##  Mean   :272.8   Mean   : 51.65   Mean   :265.0   Mean   : 53.92  
##  3rd Qu.:278.7   3rd Qu.: 78.00   3rd Qu.:270.7   3rd Qu.: 82.90  
##  Max.   :292.5   Max.   :117.80   Max.   :281.8   Max.   :126.10  
##      T_500           T2_M1          T_950_M1        T_925_M1    
##  Min.   :225.6   Min.   :246.7   Min.   :244.2   Min.   :243.8  
##  1st Qu.:243.7   1st Qu.:273.3   1st Qu.:271.3   1st Qu.:270.4  
##  Median :249.8   Median :278.9   Median :277.0   Median :276.0  
##  Mean   :249.0   Mean   :279.3   Mean   :277.3   Mean   :276.2  
##  3rd Qu.:254.8   3rd Qu.:286.6   3rd Qu.:284.3   3rd Qu.:282.9  
##  Max.   :265.3   Max.   :302.2   Max.   :299.0   Max.   :297.2  
##   DECLINATION           D2_M1           D2Obs      
##  Min.   :-23.4401   Min.   :244.2   Min.   :244.8  
##  1st Qu.:-16.6123   1st Qu.:270.7   1st Qu.:271.1  
##  Median :  0.7669   Median :275.7   Median :275.8  
##  Mean   :  0.2004   Mean   :275.9   Mean   :276.0  
##  3rd Qu.: 16.8441   3rd Qu.:282.8   3rd Qu.:282.8  
##  Max.   : 23.4401   Max.   :293.9   Max.   :293.1

Visualizing the data

Here we plot the histogram of the Observed temperature at 2 m, Temp at 2 m - ECMWF model and Temp at 2 meters - previous forecast period. We can also see Observed dew point temperature at 2 m, dew point temp at 2 m - ECMWF model and dew point temp at 2 meters - previous forecast period. Unfortunately I am not able to reduce the font size of the title, so the full titles are not shown in the figures.

library(ggplot2)
library(gridExtra)
p1 <- qplot(T2_D2_Obs_model_data_with_timestamps$T2Obs,  geom= "histogram", binwidth = 2,  xlab = "T2Obs", 
            fill = I("red3"), col = I("gray"), main="Observed temp at 2 m")

p2 <- qplot(T2_D2_Obs_model_data_with_timestamps$T2,  geom= "histogram", binwidth = 2,  xlab = "T2", 
            fill = I("tomato"), col = I("gray"), main = "Temp at 2 m - ECMWF model") 

p3 <- qplot(T2_D2_Obs_model_data_with_timestamps$T2_M1,  geom= "histogram", binwidth = 2,  xlab = "T2_M1", 
            fill = I("hotpink"), col = I("gray"), main = "Temp at 2 meters - previous forecast period") 

p4 <- qplot(T2_D2_Obs_model_data_with_timestamps$D2Obs,  geom= "histogram", binwidth = 2,  xlab = "D2Obs", 
            fill = I("blue"), col = I("gray"), main = "Observed Dewpoint temp at 2 m") 

p5 <- qplot(T2_D2_Obs_model_data_with_timestamps$D2,  geom= "histogram", binwidth = 2,  xlab = "D2", 
            fill = I("dodgerblue"), col = I("gray"), main = "Dewpoint temp at 2 m -ECMWF model") 

p6 <- qplot(T2_D2_Obs_model_data_with_timestamps$D2_M1,  geom= "histogram", binwidth = 2,  xlab = "D2_M1", 
            fill = I("deepskyblue"), col = I("gray"), main = "Dewpoint temp at 2 m -previous forecast period") 

grid.arrange(p1, p2, p3, p4,p5,p6, ncol = 3, nrow = 2)

p1 <- qplot(T2_D2_Obs_model_data_with_timestamps$U10,  geom= "histogram", binwidth = 2,  xlab = "U10", 
            fill = I("yellow"), col = I("gray"), main="10 meter U-velocity")

p2 <- qplot(T2_D2_Obs_model_data_with_timestamps$V10,  geom= "histogram", binwidth = 2,  xlab = "V10", 
            fill = I("yellowgreen"), col = I("gray"), main = "10 meter U-velocity") 


grid.arrange(p1, p2, ncol = 2, nrow = 1)

p1 <- qplot(T2_D2_Obs_model_data_with_timestamps$TP,  geom= "histogram", binwidth = 2,  xlab = "TP", 
            fill = I("bisque"), col = I("white"), main="Total precipitation")

p2 <- qplot(T2_D2_Obs_model_data_with_timestamps$LCC,  geom= "histogram", binwidth = 2,  xlab = "LCC", 
            fill = I("gray60"), col = I("white"), main = " Low cloud cover") 

p3 <- qplot(T2_D2_Obs_model_data_with_timestamps$MCC,  geom= "histogram", binwidth = 2,  xlab = "MCC", 
            fill = I("gray45"), col = I("white"), main = " Medium cloud cover") 

p4 <- qplot(T2_D2_Obs_model_data_with_timestamps$HCC,  geom= "histogram", binwidth = 2,  xlab = "HCC", 
            fill = I("black"), col = I("white"), main = "High cloud cover") 


grid.arrange(p1, p2, p3, p4, ncol = 2, nrow = 2)

p1 <- qplot(T2_D2_Obs_model_data_with_timestamps$RH_950,  geom= "histogram", binwidth = 2,  xlab = "RH_950", 
            fill = I("forestgreen"), col = I("gray"), main="Relative humidity at 950 hPa")

p2 <- qplot(T2_D2_Obs_model_data_with_timestamps$RH_925,  geom= "histogram", binwidth = 2,  xlab = "RH_925", 
            fill = I("chartreuse"), col = I("gray"), main="Relative humidity at 925 hPa")

p3 <- qplot(T2_D2_Obs_model_data_with_timestamps$T_950,  geom= "histogram", binwidth = 2,  xlab = "T_950", 
            fill = I("blueviolet"), col = I("gray"), main = "T2: Temperature at 950 hPa") 


p4 <- qplot(T2_D2_Obs_model_data_with_timestamps$T_925,  geom= "histogram", binwidth = 2,  xlab = "T_925", 
            fill = I("orchid2"), col = I("gray"), main = "T2: Temperature at 925 hPa") 


grid.arrange(p1, p2, p3, p4, ncol = 2, nrow = 2)

p1 <- qplot(T2_D2_Obs_model_data_with_timestamps$RH_850,  geom= "histogram", binwidth = 2,  xlab = "RH_850", 
            fill = I("brown4"), col = I("gray"), main="Relative humidity at 850 hPa")

p2 <- qplot(T2_D2_Obs_model_data_with_timestamps$RH_700,  geom= "histogram", binwidth = 2,  xlab = "RH_700", 
            fill = I("brown1"), col = I("gray"), main="Relative humidity at 700 hPa")

p3 <- qplot(T2_D2_Obs_model_data_with_timestamps$RH_500,  geom= "histogram", binwidth = 2,  xlab = "RH_500", 
            fill = I("darksalmon"), col = I("gray"), main="Relative humidity at 500 hPa")

p4 <- qplot(T2_D2_Obs_model_data_with_timestamps$T_850,  geom= "histogram", binwidth = 2,  xlab = "T_850", 
            fill = I("hotpink4"), col = I("gray"), main = "T2: Temperature at 850 hPa") 
p5 <- qplot(T2_D2_Obs_model_data_with_timestamps$T_700,  geom= "histogram", binwidth = 2,  xlab = "T_700", 
            fill = I("hotpink1"), col = I("gray"), main = "T2: Temperature at 700 hPa") 
p6 <- qplot(T2_D2_Obs_model_data_with_timestamps$T_500,  geom= "histogram", binwidth = 2,  xlab = "T_500", 
            fill = I("thistle1"), col = I("gray"), main = "T2: Temperature at 500 hPa") 


grid.arrange(p1, p2, p3, p4, p5, p6, ncol = 3, nrow = 2)

p1 <- qplot(T2_D2_Obs_model_data_with_timestamps$T_950_M1,  geom= "histogram", binwidth = 2,  xlab = "RH_700", 
            fill = I("purple"), col = I("gray"), main="Temperature at 950 hPa at previous forecast period")

p2 <- qplot(T2_D2_Obs_model_data_with_timestamps$T_925_M1,  geom= "histogram", binwidth = 2,  xlab = "RH_500", 
            fill = I("magenta"), col = I("gray"), main="Temperature at 925 hPa at previous forecast period")
p3 <- qplot(T2_D2_Obs_model_data_with_timestamps$SKT,  geom= "histogram", binwidth = 2,  xlab = "SKT", 
            fill = I("limegreen"), col = I("gray"), main="Skin temperature")
p4 <- qplot(T2_D2_Obs_model_data_with_timestamps$DECLINATION,  geom= "histogram", binwidth = 2,  xlab = "DECLINATION", 
            fill = I("turquoise"), col = I("gray"), main = "Declination") 

grid.arrange(p1, p2, p3, p4, ncol = 2, nrow = 2)

Correlation matrix and Scatter plots

As our aim is to develop a linear regression model for T2 and D2, it will be a good idea to study the correlation between the variables of the weather data. Below we can see the correlation between the variables of weather data. We describe in detail about the correlations when we go to the next section on the linear regression models.

cor(as.matrix(T2_D2_Obs_model_data_with_timestamps[,6:ncol(T2_D2_Obs_model_data_with_timestamps)])) 
##                    T2Obs         MSL          T2           D2
## T2Obs        1.000000000 -0.05572234  0.95677047  0.905251875
## MSL         -0.055722339  1.00000000 -0.08791121 -0.156226839
## T2           0.956770473 -0.08791121  1.00000000  0.950569781
## D2           0.905251875 -0.15622684  0.95056978  1.000000000
## U10          0.067249551 -0.25160619  0.08568880  0.092750632
## V10          0.083376989 -0.14451363  0.11372004  0.183780138
## TP           0.034104010 -0.28421222  0.05729668  0.143197694
## LCC         -0.383691554 -0.15559078 -0.34719821 -0.219142567
## MCC         -0.088984980 -0.31262176 -0.05359143 -0.009174696
## HCC          0.006123515 -0.18998591  0.02740180  0.083962326
## SKT          0.945729355 -0.07374940  0.98761518  0.926508834
## RH_950      -0.225417792 -0.33476345 -0.17360235  0.006965672
## T_950        0.933069935 -0.08665758  0.95428786  0.937783868
## RH_925      -0.102333330 -0.38616568 -0.04864560  0.096968298
## T_925        0.923163399 -0.07201997  0.94304485  0.936505625
## RH_850       0.126840672 -0.41174015  0.16883134  0.241192270
## T_850        0.882236637 -0.01406959  0.90345704  0.919827444
## RH_700      -0.051455549 -0.35600257 -0.02325322  0.053286552
## T_700        0.839162854  0.09190267  0.86100183  0.870233283
## RH_500      -0.161975033 -0.25026527 -0.14241710 -0.089229234
## T_500        0.806916932  0.13931168  0.82741602  0.822158694
## T2_M1        0.939975772 -0.09340257  0.97775655  0.942577263
## T_950_M1     0.925904241 -0.09404718  0.94656890  0.937203206
## T_925_M1     0.917905055 -0.08163763  0.93769669  0.936316586
## DECLINATION  0.870033713  0.02963431  0.89158550  0.831878801
## D2_M1        0.902328990 -0.16428168  0.94392029  0.986912795
## D2Obs        0.906747887 -0.15062462  0.89559842  0.932057501
##                       U10         V10          TP        LCC          MCC
## T2Obs        0.0672495510  0.08337699  0.03410401 -0.3836916 -0.088984980
## MSL         -0.2516061855 -0.14451363 -0.28421222 -0.1555908 -0.312621756
## T2           0.0856887975  0.11372004  0.05729668 -0.3471982 -0.053591427
## D2           0.0927506318  0.18378014  0.14319769 -0.2191426 -0.009174696
## U10          1.0000000000  0.17638564 -0.07357771 -0.1202628 -0.121005183
## V10          0.1763856378  1.00000000  0.15411112  0.1900064  0.183867058
## TP          -0.0735777117  0.15411112  1.00000000  0.2802422  0.371971817
## LCC         -0.1202627949  0.19000640  0.28024220  1.0000000  0.293703773
## MCC         -0.1210051826  0.18386706  0.37197182  0.2937038  1.000000000
## HCC         -0.1048070197  0.23374528  0.20682647  0.1283931  0.421509022
## SKT          0.0552707731  0.10184030  0.05680998 -0.3175593 -0.047112191
## RH_950       0.0007486148  0.25620542  0.30807049  0.7102336  0.283402843
## T_950        0.0924834782  0.07698242  0.04149562 -0.4237268 -0.094934276
## RH_925      -0.0271169977  0.21620653  0.31443274  0.6508228  0.316837078
## T_925        0.0898313705  0.09243508  0.04733645 -0.4089855 -0.098312864
## RH_850      -0.0630461564  0.10901970  0.32251009  0.3771230  0.438559607
## T_850        0.0555016718  0.15054153  0.06406793 -0.3152063 -0.114083405
## RH_700      -0.0702727463  0.10852325  0.36381725  0.2768023  0.589872232
## T_700       -0.0231638329  0.16060536  0.05516177 -0.2563162 -0.115870251
## RH_500      -0.0483983706  0.14527154  0.21917937  0.1825503  0.581506818
## T_500       -0.0401444308  0.13379521  0.04069243 -0.2832337 -0.088998560
## T2_M1        0.0892890825  0.08541580  0.05910432 -0.3628751 -0.065680899
## T_950_M1     0.0940793151  0.05586961  0.04401064 -0.4188925 -0.100053204
## T_925_M1     0.0939991617  0.07144738  0.04845132 -0.4035580 -0.104133257
## DECLINATION -0.0065621144 -0.08136880  0.02631006 -0.4433739 -0.101191179
## D2_M1        0.1011994993  0.14830507  0.12800187 -0.2345538 -0.031400001
## D2Obs        0.0977762603  0.15584330  0.10274938 -0.2355420 -0.030938198
##                      HCC         SKT        RH_950       T_950      RH_925
## T2Obs        0.006123515  0.94572935 -0.2254177916  0.93306994 -0.10233333
## MSL         -0.189985910 -0.07374940 -0.3347634473 -0.08665758 -0.38616568
## T2           0.027401800  0.98761518 -0.1736023523  0.95428786 -0.04864560
## D2           0.083962326  0.92650883  0.0069656723  0.93778387  0.09696830
## U10         -0.104807020  0.05527077  0.0007486148  0.09248348 -0.02711700
## V10          0.233745277  0.10184030  0.2562054249  0.07698242  0.21620653
## TP           0.206826472  0.05680998  0.3080704853  0.04149562  0.31443274
## LCC          0.128393126 -0.31755930  0.7102335630 -0.42372680  0.65082283
## MCC          0.421509022 -0.04711219  0.2834028426 -0.09493428  0.31683708
## HCC          1.000000000  0.02115931  0.1743092262  0.02501490  0.16744109
## SKT          0.021159308  1.00000000 -0.1557906797  0.92156630 -0.03598029
## RH_950       0.174309226 -0.15579068  1.0000000000 -0.28194771  0.89481233
## T_950        0.025014901  0.92156630 -0.2819477092  1.00000000 -0.14805078
## RH_925       0.167441089 -0.03598029  0.8948123298 -0.14805078  1.00000000
## T_925        0.035176482  0.91046667 -0.2671040922  0.99317960 -0.16967106
## RH_850       0.145120710  0.17146633  0.4416321765  0.13018712  0.60041071
## T_850        0.067579127  0.87433721 -0.1586977252  0.93948008 -0.09611277
## RH_700       0.245108324 -0.01781787  0.3563350800 -0.04073663  0.39761722
## T_700        0.082134768  0.83759521 -0.1457914095  0.87963528 -0.07531005
## RH_500       0.489669741 -0.14107862  0.2185775948 -0.15236469  0.21102203
## T_500        0.107825895  0.80605285 -0.1856005460  0.84291866 -0.11415980
## T2_M1        0.020391763  0.95404922 -0.1825229238  0.95292850 -0.05854807
## T_950_M1     0.021741645  0.91240215 -0.2661152157  0.99197565 -0.14178616
## T_925_M1     0.030050261  0.90396482 -0.2522525073  0.98672831 -0.15453719
## DECLINATION -0.041371530  0.88899220 -0.2920096878  0.87992845 -0.16398576
## D2_M1        0.066000071  0.91985131 -0.0100536107  0.93337723  0.08077807
## D2Obs        0.063453247  0.87373781 -0.0192977938  0.88936475  0.07055859
##                   T_925      RH_850       T_850      RH_700       T_700
## T2Obs        0.92316340  0.12684067  0.88223664 -0.05145555  0.83916285
## MSL         -0.07201997 -0.41174015 -0.01406959 -0.35600257  0.09190267
## T2           0.94304485  0.16883134  0.90345704 -0.02325322  0.86100183
## D2           0.93650563  0.24119227  0.91982744  0.05328655  0.87023328
## U10          0.08983137 -0.06304616  0.05550167 -0.07027275 -0.02316383
## V10          0.09243508  0.10901970  0.15054153  0.10852325  0.16060536
## TP           0.04733645  0.32251009  0.06406793  0.36381725  0.05516177
## LCC         -0.40898549  0.37712300 -0.31520630  0.27680226 -0.25631621
## MCC         -0.09831286  0.43855961 -0.11408341  0.58987223 -0.11587025
## HCC          0.03517648  0.14512071  0.06757913  0.24510832  0.08213477
## SKT          0.91046667  0.17146633  0.87433721 -0.01781787  0.83759521
## RH_950      -0.26710409  0.44163218 -0.15869773  0.35633508 -0.14579141
## T_950        0.99317960  0.13018712  0.93948008 -0.04073663  0.87963528
## RH_925      -0.16967106  0.60041071 -0.09611277  0.39761722 -0.07531005
## T_925        1.00000000  0.11250659  0.95761159 -0.04121105  0.89607555
## RH_850       0.11250659  1.00000000  0.02791087  0.53046817  0.04289074
## T_850        0.95761159  0.02791087  1.00000000 -0.04726922  0.94916513
## RH_700      -0.04121105  0.53046817 -0.04726922  1.00000000 -0.11481186
## T_700        0.89607555  0.04289074  0.94916513 -0.11481186  1.00000000
## RH_500      -0.15014955  0.20837072 -0.14577627  0.43980885 -0.15623849
## T_500        0.85666579  0.02188350  0.89948416 -0.10318307  0.95570459
## T2_M1        0.94181997  0.16072684  0.90111712 -0.02887439  0.85620271
## T_950_M1     0.98675557  0.12494343  0.93595178 -0.04081655  0.87451829
## T_925_M1     0.99188630  0.10740663  0.95345013 -0.04284956  0.89073414
## DECLINATION  0.86593866  0.11576323  0.80617637 -0.05128072  0.77495001
## D2_M1        0.93119075  0.22321276  0.91292205  0.03326261  0.86085167
## D2Obs        0.88767322  0.20937636  0.87188165  0.02998447  0.82491440
##                  RH_500       T_500       T2_M1    T_950_M1    T_925_M1
## T2Obs       -0.16197503  0.80691693  0.93997577  0.92590424  0.91790505
## MSL         -0.25026527  0.13931168 -0.09340257 -0.09404718 -0.08163763
## T2          -0.14241710  0.82741602  0.97775655  0.94656890  0.93769669
## D2          -0.08922923  0.82215869  0.94257726  0.93720321  0.93631659
## U10         -0.04839837 -0.04014443  0.08928908  0.09407932  0.09399916
## V10          0.14527154  0.13379521  0.08541580  0.05586961  0.07144738
## TP           0.21917937  0.04069243  0.05910432  0.04401064  0.04845132
## LCC          0.18255027 -0.28323374 -0.36287509 -0.41889254 -0.40355801
## MCC          0.58150682 -0.08899856 -0.06568090 -0.10005320 -0.10413326
## HCC          0.48966974  0.10782590  0.02039176  0.02174164  0.03005026
## SKT         -0.14107862  0.80605285  0.95404922  0.91240215  0.90396482
## RH_950       0.21857759 -0.18560055 -0.18252292 -0.26611522 -0.25225251
## T_950       -0.15236469  0.84291866  0.95292850  0.99197565  0.98672831
## RH_925       0.21102203 -0.11415980 -0.05854807 -0.14178616 -0.15453719
## T_925       -0.15014955  0.85666579  0.94181997  0.98675557  0.99188630
## RH_850       0.20837072  0.02188350  0.16072684  0.12494343  0.10740663
## T_850       -0.14577627  0.89948416  0.90111712  0.93595178  0.95345013
## RH_700       0.43980885 -0.10318307 -0.02887439 -0.04081655 -0.04284956
## T_700       -0.15623849  0.95570459  0.85620271  0.87451829  0.89073414
## RH_500       1.00000000 -0.16433429 -0.14865150 -0.15337225 -0.15279333
## T_500       -0.16433429  1.00000000  0.82191579  0.83701722  0.85025947
## T2_M1       -0.14865150  0.82191579  1.00000000  0.95418324  0.94296268
## T_950_M1    -0.15337225  0.83701722  0.95418324  1.00000000  0.99321331
## T_925_M1    -0.15279333  0.85025947  0.94296268  0.99321331  1.00000000
## DECLINATION -0.17024680  0.76708514  0.89095089  0.87972000  0.86576181
## D2_M1       -0.10612808  0.81133041  0.95047514  0.93765387  0.93637685
## D2Obs       -0.09725270  0.78030745  0.89262683  0.88997069  0.88868619
##              DECLINATION       D2_M1       D2Obs
## T2Obs        0.870033713  0.90232899  0.90674789
## MSL          0.029634306 -0.16428168 -0.15062462
## T2           0.891585496  0.94392029  0.89559842
## D2           0.831878801  0.98691280  0.93205750
## U10         -0.006562114  0.10119950  0.09777626
## V10         -0.081368797  0.14830507  0.15584330
## TP           0.026310057  0.12800187  0.10274938
## LCC         -0.443373868 -0.23455381 -0.23554203
## MCC         -0.101191179 -0.03140000 -0.03093820
## HCC         -0.041371530  0.06600007  0.06345325
## SKT          0.888992201  0.91985131  0.87373781
## RH_950      -0.292009688 -0.01005361 -0.01929779
## T_950        0.879928453  0.93337723  0.88936475
## RH_925      -0.163985756  0.08077807  0.07055859
## T_925        0.865938655  0.93119075  0.88767322
## RH_850       0.115763230  0.22321276  0.20937636
## T_850        0.806176371  0.91292205  0.87188165
## RH_700      -0.051280717  0.03326261  0.02998447
## T_700        0.774950012  0.86085167  0.82491440
## RH_500      -0.170246803 -0.10612808 -0.09725270
## T_500        0.767085144  0.81133041  0.78030745
## T2_M1        0.890950895  0.95047514  0.89262683
## T_950_M1     0.879720001  0.93765387  0.88997069
## T_925_M1     0.865761809  0.93637685  0.88868619
## DECLINATION  1.000000000  0.83118261  0.80153457
## D2_M1        0.831182610  1.00000000  0.92886882
## D2Obs        0.801534566  0.92886882  1.00000000
library(corrplot); library(dplyr)
cor_matrix<-cor(T2_D2_Obs_model_data_with_timestamps[,6:ncol(T2_D2_Obs_model_data_with_timestamps)]) %>% round(digits=2)
# visualize the correlation matrix
corrplot(cor_matrix, method="number", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)

Scatter plots and distributions

When we try to plot the scatter plot for even winter data with all the variables, it was very difficult to visualize the data as we have 26 variables. So we take a smaller set of variables such as “T2Obs”, “MSL”, “T2”,“D2”, “U10”,“V10”,“TP”, “LCC”,“MCC”, “HCC”,“SKT”,“RH_950”,“T_950”,“T2_M1”,“T_950_M1”,“DECLINATION”, “D2_M1”,“D2Obs” and do the scatter plot.

library (ggplot2)
library (GGally)
df_winter_00_08 <- T2_D2_Obs_model_data_with_timestamps[which(T2_D2_Obs_model_data_with_timestamps$forecast_period == 8 & T2_D2_Obs_model_data_with_timestamps$analysis_time == 1 & T2_D2_Obs_model_data_with_timestamps$season == 1),]
df_winter_00_08_small <- df_winter_00_08[, c(6, 7,8, 9,10, 11,12,13,14,15,16,17,18,27,28,30,31,32)]
ggpairs(df_winter_00_08_small[,1:ncol(df_winter_00_08_small)], mapping = aes(), lower = list(combo = wrap("facethist", bins = 20))) + ggtitle("Matrix of scatter plots and distributions of weather data")

T2 : Air temperature at 2 meters

We plot below the T2Obs with the observations. We may not infer much from the plot other than a seasonal change in the observed temperature at 2 meters. So we separate the data into small chunks based on the seasons and the forecast periods.

plot(T2_D2_Obs_model_data_with_timestamps$T2Obs, main = "Observed temperature at 2 meters", xlab = "Observations", ylab ="T2Obs")

Data for analysis time 00UTC and forecast period 8

Below we can see the weather data for all the seasons where the analysis time in 00:00 UTC and for one-day forecast.

df_00_08 <- T2_D2_Obs_model_data_with_timestamps[which(T2_D2_Obs_model_data_with_timestamps$forecast_period == 8 & T2_D2_Obs_model_data_with_timestamps$analysis_time == 1),]
df_00_08$timestamp = as.Date(strptime(df_00_08$timestamp, "%Y-%m-%d%H:%M:%S"))
qplot(df_00_08$timestamp, df_00_08$T2Obs, xlab="year", ylab="T2Obs")

Data for winter and analysis time 00UTC and forecast period 8

Below we can see the weather data for the winter season where the analysis time in 00:00 UTC and for one-day forecast.

df_winter_00_08 <- T2_D2_Obs_model_data_with_timestamps[which(T2_D2_Obs_model_data_with_timestamps$forecast_period == 8 & T2_D2_Obs_model_data_with_timestamps$analysis_time == 1 & T2_D2_Obs_model_data_with_timestamps$season == 1),]
df_winter_00_08$timestamp = as.Date(strptime(df_winter_00_08$timestamp, "%Y-%m-%d%H:%M:%S"))
qplot(df_winter_00_08$timestamp, df_winter_00_08$T2Obs, xlab="year", ylab="T2Obs")

QQ plot of winter temp

This plot is interesting, we can see that the low temperatures do not have a linear variation.

qqnorm(df_winter_00_08$T2Obs); qqline(df_winter_00_08$T2Obs)

QQ plot of winter ECMWF model T2

We can observe that the T2 from ECMWF model differs from the T2Obs.

qqnorm(df_winter_00_08$T2); qqline(df_winter_00_08$T2)

Data for winter and analysis time 00UTC and forecast period 10 days

df_winter_00_65 <- T2_D2_Obs_model_data_with_timestamps[which(T2_D2_Obs_model_data_with_timestamps$forecast_period == 65 & T2_D2_Obs_model_data_with_timestamps$analysis_time == 1 & T2_D2_Obs_model_data_with_timestamps$season == 3),]
df_winter_00_65$timestamp = as.Date(strptime(df_winter_00_65$timestamp, "%Y-%m-%d%H:%M:%S"))
qplot(df_winter_00_65$timestamp, df_winter_00_65$T2Obs, xlab="year", ylab="T2Obs")

Data for summer and analysis time 00UTC and forecast period 10 days

df_summer_00_65 <- T2_D2_Obs_model_data_with_timestamps[which(T2_D2_Obs_model_data_with_timestamps$forecast_period == 65 & T2_D2_Obs_model_data_with_timestamps$analysis_time == 3 & T2_D2_Obs_model_data_with_timestamps$season == 3),]
df_summer_00_65$timestamp = as.Date(strptime(df_summer_00_65$timestamp, "%Y-%m-%d%H:%M:%S"))
qplot(df_summer_00_65$timestamp, df_summer_00_65$T2Obs, xlab="year", ylab="T2Obs")

### QQ plot of winter temp T2Obs 10 day forecast

qqnorm(df_winter_00_65$T2Obs); qqline(df_winter_00_65$T2Obs)

QQ plot of winter temp T2 from ECMWF model for the 10 day forecast

qqnorm(df_winter_00_65$T2); qqline(df_winter_00_65$T2)

Data for summer and analysis time 00UTC and forecast period 8

We also shows similar graphs for the summer temperatures.

df_summer_00_08 <- T2_D2_Obs_model_data_with_timestamps[which(T2_D2_Obs_model_data_with_timestamps$forecast_period == 8 & T2_D2_Obs_model_data_with_timestamps$analysis_time == 1 & T2_D2_Obs_model_data_with_timestamps$season == 3),]
df_summer_00_08$timestamp = as.Date(strptime(df_summer_00_08$timestamp, "%Y-%m-%d%H:%M:%S"))
qplot(df_summer_00_08$timestamp, df_summer_00_08$T2Obs, xlab="year", ylab="T2Obs")

QQ plot of summer observed T2Obs

qqnorm(df_summer_00_08$T2Obs); qqline(df_summer_00_08$T2Obs)

### QQ plot of summer ECMWF model T2

qqnorm(df_summer_00_08$T2); qqline(df_summer_00_08$T2)

Data for summer and analysis time 00UTC and forecast period 8

library(ggplot2)
df_summer_00_08 <- T2_D2_Obs_model_data_with_timestamps[which(T2_D2_Obs_model_data_with_timestamps$forecast_period == 8 & T2_D2_Obs_model_data_with_timestamps$analysis_time == 1 & T2_D2_Obs_model_data_with_timestamps$season == 3),]
df_summer_00_08$timestamp = as.Date(strptime(df_summer_00_08$timestamp, "%Y-%m-%d%H:%M:%S"))
qplot(df_summer_00_08$timestamp, df_summer_00_08$T2Obs, xlab="year", ylab="T2Obs")

QQ plot of summer observed T2Obs

qqnorm(df_summer_00_08$T2Obs); qqline(df_summer_00_08$T2Obs)

### QQ plot of summer ECMWF model T2

qqnorm(df_summer_00_08$T2); qqline(df_summer_00_08$T2)

T2Obs vs T2 from ECMWF model for Winter forecast period 08

Now we plot some graphs depicting the relation ship between T2Obs ad T2 from ECMWF for winter for one-day forecast. We can observe that at high temperatures there is a linear relationship than for the low values of T2Obs ad T2 from ECMWF. We used to method lm to plot this relationship.

T2Obs <- df_winter_00_08$T2Obs
T2 <- df_winter_00_08$T2
ggplot(df_winter_00_08, aes(x = T2Obs, y = T2)) + geom_point() + xlab("ECMWF model T2") + ylab("T2Obs") + ggtitle("T2 observed vs ECMWF model T2 for Winter ") + geom_smooth(method = "lm")

D2Obs vs D2 from ECMWF model for Winter forecast period 08

A similar kind of graph we can observe between D2Obs and D2 from ECMWF for winter for one-day forecast

D2Obs <- df_winter_00_08$D2Obs
D2 <- df_winter_00_08$D2
ggplot(df_winter_00_08, aes(x = T2Obs, y = T2)) + geom_point() + xlab("ECMWF model D2") + ylab("D2Obs") + ggtitle("D2 observed vs ECMWF model D2 for Winter") + geom_smooth(method = "lm")

T2Obs vs T2 from ECMWF model for Winter forecast period 65

we can observe that for 10-day forecast, the points of T2Obs and T2 from ECMWF model are more scattered. A similar observation is there between D2Obs and D2 from ECMWF model for the 10 day forecast.

T2Obs <- df_winter_00_65$T2Obs
T2 <- df_winter_00_65$T2
ggplot(df_winter_00_65, aes(x = T2Obs, y = T2)) + geom_point() + xlab("ECMWF model T2") + ylab("T2Obs") + ggtitle("T2 observed vs ECMWF model T2 for Winter ") + geom_smooth(method = "lm")

D2Obs vs D2 from ECMWF model for Winter forecast period 65

D2Obs <- df_winter_00_65$D2Obs
D2 <- df_winter_00_65$D2
ggplot(df_winter_00_65, aes(x = T2Obs, y = T2)) + geom_point() + xlab("ECMWF model D2") + ylab("D2Obs") + ggtitle("D2 observed vs ECMWF model D2 for Winter") + geom_smooth(method = "lm")

#### T2Obs vs T2 from ECMWF model for Summer forecast period 08

The following plots show the relation ship between T2Obs and T2 from ECMWF model and T2Obs and T2 from ECMWF model for one-day and 10 day forecasts.

T2Obs <- df_winter_00_08$T2Obs
T2 <- df_summer_00_08$T2
ggplot(df_summer_00_08, aes(x = T2Obs, y = T2)) + geom_point() + xlab("ECMWF model T2") + ylab("T2Obs") + ggtitle("T2 observed vs ECMWF model T2 for Summer") + geom_smooth(method = "lm")

D2Obs vs D2 from ECMWF model for Summer forecast period 08

D2Obs <- df_winter_00_08$D2Obs
D2 <- df_summer_00_08$D2
ggplot(df_summer_00_08, aes(x = T2Obs, y = D2)) + geom_point() + xlab("ECMWF model D2") + ylab("D2Obs") + ggtitle("D2 observed vs ECMWF model D2 for Summer") + geom_smooth(method = "lm")

T2Obs vs T2 from ECMWF for Summer forecast period 65

T2Obs <- df_winter_00_65$T2Obs
T2 <- df_summer_00_65$T2
ggplot(df_summer_00_65, aes(x = T2Obs, y = T2)) + geom_point() + xlab("ECMWF model T2") + ylab("T2Obs") + ggtitle("T2 observed vs ECMWF model T2 for Summer") + geom_smooth(method = "lm")

D2Obs vs D2 from ECMWF for Summer forecast period 65

D2Obs <- df_winter_00_65$D2Obs
D2 <- df_summer_00_65$D2
ggplot(df_summer_00_65, aes(x = T2Obs, y = D2)) + geom_point() + xlab("ECMWF model D2") + ylab("D2Obs") + ggtitle("D2 observed vs ECMWF model D2 for Summer") + geom_smooth(method = "lm")

The difference in T2Obs and T2 from ECMWF model for all the forecast periods and for all seasons.

Now we plot quantiles of the difference in T2Obs and T2 from ECMWF model for all the forecast periods. We Can see that as the length of the forecast period increases, the variation in the quantiles increases both for high temperatures and low temperatures.

T2Obs_T2 <- T2_D2_Obs_model_data_with_timestamps$T2Obs - T2_D2_Obs_model_data_with_timestamps$T2
df_T2Obs_T2 <- cbind(T2_D2_Obs_model_data_with_timestamps, T2Obs_T2)
colnames(df_T2Obs_T2[ncol(df_T2Obs_T2)]) <- "T2Obs_T2"
ggplot(df_T2Obs_T2, aes(x = factor(df_T2Obs_T2$forecast_period), y = df_T2Obs_T2$T2Obs_T2))+ xlab("forecast_period(2-65)") + ylab("T2Obs - T2") + ggtitle("Difference in the Observed temperature and the temperature from ECMWF model") + geom_boxplot(width = 1, colour="blue") 


Development of a linear regression model for weather data

In order to model T2 and D2 we use the linear regression model. Linear regression is the modelling technique in which we find a linear relationship between a scalar dependent variable, called response or predictand and one or more exploratory variables called predictors.

Let us take a statistical dataset which consists of $(y, x_1, x_2, …., x_n) $ where \(y\) is the response or predictand variable and $ x_1, x_2, …., x_n)$ are the predictors. We can find a linear relationship between the predictand and the predictors as

$ ,= , _1 x_1 + _2_x2 + + _n x_n $, where $ $ is the approximation to the predictand \(y\)

In linear regression model our aim is to find the \(\beta\) variables so that the squared error, $ (y - )^2 $ is minimum.

In our assignment, we find the relationship between T2Obs and the variables given as the output of the ECMWF numerical prediction model. A similar regression model for the D2Obs and the ECMWF outputs are also found.

The weather data vary widely between the seasons and the ECMWF forecast model depends much on the forecast period. But for the analysis purpose we take only one dataset for analysis, winter data for 1-day forecast period and summer data for one-day forecast period. We predict T2 and D2 using our model on the data. In the later part of the assignment we use the model to predict T2 and D2 on the whole Kumpula weather station data for 4 seasons and 64 forecast periods.

Load the data for one-day forecast for the winter season

We load the data from the file “data/station2998_T2_D2_Obs_model_data_with_timestamps.csv” and select the data for one-day forecast for the winter season. We use the analysis period as 00:00 UTC. We then randomly split the data as training and test sets. The dataset is named as df_winter_00_08.

The dataset df_winter has 573 observations of 32 variables. The first four columns are the station_id, season, analysis_time, forecast_period and timestamp and we call them collectively as station info. The 6th column is T2Obs and the 32nd column is the D2Obs. The columns from 7 to 31 contains the model data from the ECMWF numerical model.

# loading the data from the file 
T2_D2_Obs_model_data_with_timestamps = read.table(file = "data/station2998_T2_D2_Obs_model_data_with_timestamps.csv", sep = ",", header = TRUE)
df_winter_00_08 <- T2_D2_Obs_model_data_with_timestamps[which(T2_D2_Obs_model_data_with_timestamps$forecast_period == 8 & T2_D2_Obs_model_data_with_timestamps$analysis_time == 1 & T2_D2_Obs_model_data_with_timestamps$season == 1),]

dim(df_winter_00_08)
## [1] 573  32
str(df_winter_00_08)
## 'data.frame':    573 obs. of  32 variables:
##  $ station_id     : int  2998 2998 2998 2998 2998 2998 2998 2998 2998 2998 ...
##  $ season         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ analysis_time  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ forecast_period: int  8 8 8 8 8 8 8 8 8 8 ...
##  $ timestamp      : Factor w/ 11184 levels "2011-12-01 03:00:00",..: 7 15 23 31 39 47 55 63 71 79 ...
##  $ T2Obs          : num  279 276 278 277 277 ...
##  $ MSL            : num  100384 99098 98962 97654 97876 ...
##  $ T2             : num  278 276 277 276 275 ...
##  $ D2             : num  276 275 275 275 271 ...
##  $ U10            : num  3 3.7 0.5 7 4.6 3.7 1.1 2.8 -2.3 6 ...
##  $ V10            : num  10.2 0.6 11.1 8 2.6 0.8 7.9 4 10.6 2.1 ...
##  $ TP             : num  0.03 0.2 0.07 0.1 0 0 0.47 0.17 0.5 0.17 ...
##  $ LCC            : num  0.05 0.02 0.86 0.39 0.12 0.15 0.51 0.55 0.62 0.8 ...
##  $ MCC            : num  0.99 1 0.88 0.08 0.27 0.98 0.91 0.4 0.69 0.71 ...
##  $ HCC            : num  1 0.95 0.92 0.76 0 0.98 0.97 0 0.32 0.59 ...
##  $ SKT            : num  278 276 277 277 275 ...
##  $ RH_950         : num  87.2 84.7 95.1 94 57.2 67.2 98.1 92.6 97.4 95.9 ...
##  $ T_950          : num  275 274 274 274 275 ...
##  $ RH_925         : num  72.2 82.2 99.7 93.4 56.1 ...
##  $ T_925          : num  274 273 272 273 274 ...
##  $ RH_850         : num  37.8 63 93.7 84.2 72.5 ...
##  $ T_850          : num  273 268 269 269 267 ...
##  $ RH_700         : num  82.7 77.2 76.2 60.7 26.6 19.8 61 19.9 97.1 84.5 ...
##  $ T_700          : num  262 258 260 258 257 ...
##  $ RH_500         : num  38.1 99.4 82 72.6 43.6 ...
##  $ T_500          : num  248 242 246 238 236 ...
##  $ T2_M1          : num  279 277 278 276 275 ...
##  $ T_950_M1       : num  275 274 275 274 273 ...
##  $ T_925_M1       : num  275 273 274 273 272 ...
##  $ DECLINATION    : num  -13.7 -14.1 -14.4 -14.7 -15 ...
##  $ D2_M1          : num  276 275 275 276 274 ...
##  $ D2Obs          : num  275 274 274 275 274 ...

Splitting the training and testing data.

From the df_winter_00_08, we create two datasets one for the T2Obs and model data and the other for D2Obs and the model data. The model data is the same for both the datasets and the response variables T2Obs and D2Obs are in the first column of the T2Obs dataset and the D2Obs dataset respectively. We then split the data into train and test data. The train data has 430 observations of 26 variables and the test data has 143 observations of 26 variables. Notice that we removed the station info is removed as the variables there are not influencing the linear regression model.

# Splitting into train and test data
  set.seed(123)
  n <- nrow(df_winter_00_08)
  train = sample(1:n, size = round(0.75*n), replace=FALSE)
  T2_winter_00_08_train <- df_winter_00_08[train,6:(ncol(df_winter_00_08) - 1)]
  T2_winter_00_08_test <-  df_winter_00_08[-train,6:(ncol(df_winter_00_08) - 1)] 
  D2_winter_00_08_train <- cbind(df_winter_00_08[train,ncol(df_winter_00_08)],
                                 df_winter_00_08[train,7:(ncol(df_winter_00_08) - 1)])
  colnames(D2_winter_00_08_train)[1] <- "D2Obs"
  D2_winter_00_08_test <- cbind(df_winter_00_08[-train,ncol(df_winter_00_08)],
                                df_winter_00_08[-train,7:(ncol(df_winter_00_08) - 1)])
  colnames(D2_winter_00_08_test)[1] <- "D2Obs"
  
  dim(T2_winter_00_08_train)
## [1] 430  26
  dim(T2_winter_00_08_test)
## [1] 143  26
  dim(D2_winter_00_08_train)
## [1] 430  26
  dim(D2_winter_00_08_test)
## [1] 143  26
  str(T2_winter_00_08_train)
## 'data.frame':    430 obs. of  26 variables:
##  $ T2Obs      : num  270 274 266 264 274 ...
##  $ MSL        : num  102001 103628 102918 102871 99973 ...
##  $ T2         : num  271 274 267 269 274 ...
##  $ D2         : num  268 272 264 266 272 ...
##  $ U10        : num  -2.6 2.6 -1.2 -6.2 4.7 3.7 6.2 2.7 4.8 1.5 ...
##  $ V10        : num  6.1 0.5 -3 1.1 5.3 1.4 0.5 3.5 1.9 -3 ...
##  $ TP         : num  0.1 0 0 0.03 0.2 0 0 0.17 0.1 0 ...
##  $ LCC        : num  0.99 0.69 1 0.93 0.58 0 0 0.52 1 0 ...
##  $ MCC        : num  0.74 0.94 0 0.55 0.59 0.78 0.47 0.29 0.47 0 ...
##  $ HCC        : num  0 0 0 0.94 0 0 0 0 0 0 ...
##  $ SKT        : num  272 274 266 270 273 ...
##  $ RH_950     : num  94.8 95.9 95.3 100.7 99 ...
##  $ T_950      : num  266 270 263 264 272 ...
##  $ RH_925     : num  100.5 95.7 98 73.8 100.6 ...
##  $ T_925      : num  264 269 261 265 270 ...
##  $ RH_850     : num  98.8 90.8 63.2 23.5 91.9 36.5 26.9 83.3 66.8 32.7 ...
##  $ T_850      : num  261 266 261 266 267 ...
##  $ RH_700     : num  104.4 54 76.6 12.8 90.1 ...
##  $ T_700      : num  254 259 256 258 256 ...
##  $ RH_500     : num  38 23 26.8 96 64.5 85.4 5.3 19 90.9 35.9 ...
##  $ T_500      : num  238 244 241 242 236 ...
##  $ T2_M1      : num  270 274 267 269 274 ...
##  $ T_950_M1   : num  265 271 262 264 272 ...
##  $ T_925_M1   : num  263 269 260 264 270 ...
##  $ DECLINATION: num  -17.7 -12.7 -20.4 -23.4 -19.9 ...
##  $ D2_M1      : num  266 272 265 266 273 ...
  str(D2_winter_00_08_train)
## 'data.frame':    430 obs. of  26 variables:
##  $ D2Obs      : num  266 273 265 262 274 ...
##  $ MSL        : num  102001 103628 102918 102871 99973 ...
##  $ T2         : num  271 274 267 269 274 ...
##  $ D2         : num  268 272 264 266 272 ...
##  $ U10        : num  -2.6 2.6 -1.2 -6.2 4.7 3.7 6.2 2.7 4.8 1.5 ...
##  $ V10        : num  6.1 0.5 -3 1.1 5.3 1.4 0.5 3.5 1.9 -3 ...
##  $ TP         : num  0.1 0 0 0.03 0.2 0 0 0.17 0.1 0 ...
##  $ LCC        : num  0.99 0.69 1 0.93 0.58 0 0 0.52 1 0 ...
##  $ MCC        : num  0.74 0.94 0 0.55 0.59 0.78 0.47 0.29 0.47 0 ...
##  $ HCC        : num  0 0 0 0.94 0 0 0 0 0 0 ...
##  $ SKT        : num  272 274 266 270 273 ...
##  $ RH_950     : num  94.8 95.9 95.3 100.7 99 ...
##  $ T_950      : num  266 270 263 264 272 ...
##  $ RH_925     : num  100.5 95.7 98 73.8 100.6 ...
##  $ T_925      : num  264 269 261 265 270 ...
##  $ RH_850     : num  98.8 90.8 63.2 23.5 91.9 36.5 26.9 83.3 66.8 32.7 ...
##  $ T_850      : num  261 266 261 266 267 ...
##  $ RH_700     : num  104.4 54 76.6 12.8 90.1 ...
##  $ T_700      : num  254 259 256 258 256 ...
##  $ RH_500     : num  38 23 26.8 96 64.5 85.4 5.3 19 90.9 35.9 ...
##  $ T_500      : num  238 244 241 242 236 ...
##  $ T2_M1      : num  270 274 267 269 274 ...
##  $ T_950_M1   : num  265 271 262 264 272 ...
##  $ T_925_M1   : num  263 269 260 264 270 ...
##  $ DECLINATION: num  -17.7 -12.7 -20.4 -23.4 -19.9 ...
##  $ D2_M1      : num  266 272 265 266 273 ...

Correlation matrix

Here we find the correlation matrix for the variables of the dataset df_winter_00_08

library(corrplot); library(dplyr)
cor(as.matrix(df_winter_00_08[,6:ncol(df_winter_00_08)])) 
##                   T2Obs          MSL          T2         D2         U10
## T2Obs        1.00000000 -0.413811966  0.94819952  0.9283360  0.39888861
## MSL         -0.41381197  1.000000000 -0.42317943 -0.4285554 -0.29288672
## T2           0.94819952 -0.423179432  1.00000000  0.9637727  0.34185343
## D2           0.92833604 -0.428555359  0.96377269  1.0000000  0.34243224
## U10          0.39888861 -0.292886725  0.34185343  0.3424322  1.00000000
## V10          0.40988014 -0.139920196  0.43500360  0.4683110  0.15677157
## TP           0.12437463 -0.310715188  0.19148969  0.2329119 -0.17057429
## LCC         -0.03288617 -0.072495229  0.09124882  0.1843243 -0.24070874
## MCC          0.14525132 -0.307142732  0.20605611  0.2101621 -0.12633130
## HCC          0.23019018 -0.239862584  0.26314762  0.2699033 -0.07444964
## SKT          0.88635334 -0.416393861  0.96003024  0.9324701  0.26090652
## RH_950       0.18680122 -0.332949329  0.27585845  0.3713458 -0.05612313
## T_950        0.88894005 -0.400354657  0.86378443  0.8619767  0.47067278
## RH_925       0.21611920 -0.405869358  0.30219819  0.3676704 -0.05547312
## T_925        0.85305415 -0.350232568  0.82765673  0.8354953  0.45719207
## RH_850       0.16796347 -0.399022913  0.22836550  0.2647076 -0.06964610
## T_850        0.76939908 -0.218716663  0.76410634  0.7756818  0.35151703
## RH_700       0.04118724 -0.293594088  0.09996808  0.1134363 -0.07876769
## T_700        0.67602135 -0.070510714  0.68220320  0.6821273  0.21442493
## RH_500       0.09403511 -0.245685570  0.13023430  0.1367390 -0.07298243
## T_500        0.58733485  0.004315872  0.59148301  0.5743458  0.17941813
## T2_M1        0.94419536 -0.427028839  0.97794187  0.9393936  0.35542012
## T_950_M1     0.88398440 -0.411611039  0.85505911  0.8462214  0.46579184
## T_925_M1     0.85172114 -0.366731359  0.82271175  0.8228920  0.46300772
## DECLINATION  0.45237252 -0.029596156  0.44224781  0.3801482  0.08407667
## D2_M1        0.92448408 -0.440432987  0.95124248  0.9799598  0.35241699
## D2Obs        0.94193317 -0.455461577  0.93054344  0.9664760  0.38240272
##                     V10          TP         LCC          MCC         HCC
## T2Obs        0.40988014  0.12437463 -0.03288617  0.145251318  0.23019018
## MSL         -0.13992020 -0.31071519 -0.07249523 -0.307142732 -0.23986258
## T2           0.43500360  0.19148969  0.09124882  0.206056113  0.26314762
## D2           0.46831103  0.23291194  0.18432430  0.210162104  0.26990327
## U10          0.15677157 -0.17057429 -0.24070874 -0.126331297 -0.07444964
## V10          1.00000000  0.20778973  0.24391282  0.236058803  0.32569156
## TP           0.20778973  1.00000000  0.31979717  0.458878396  0.24949343
## LCC          0.24391282  0.31979717  1.00000000  0.254180818  0.16403613
## MCC          0.23605880  0.45887840  0.25418082  1.000000000  0.48023157
## HCC          0.32569156  0.24949343  0.16403613  0.480231570  1.00000000
## SKT          0.42663298  0.23951905  0.22575965  0.244730758  0.27416486
## RH_950       0.30877422  0.31945589  0.69139249  0.241499869  0.19327126
## T_950        0.30829153  0.10720762 -0.16829073  0.094443642  0.19841297
## RH_925       0.26975527  0.35831048  0.65505308  0.297044509  0.21018173
## T_925        0.31259035  0.09297324 -0.16427212  0.068320297  0.19325203
## RH_850       0.16899532  0.43744112  0.49746405  0.467002703  0.16251084
## T_850        0.39728664  0.08222362 -0.03616471  0.007852306  0.24965827
## RH_700       0.10825689  0.41840813  0.29728990  0.608493452  0.22022954
## T_700        0.41755364  0.09861702  0.05299987  0.036553298  0.28319161
## RH_500       0.17528982  0.28480045  0.13914094  0.616356463  0.51377097
## T_500        0.36518165  0.08031987  0.01984444  0.065538841  0.32310207
## T2_M1        0.36028397  0.16101713  0.02152608  0.167021367  0.23265653
## T_950_M1     0.24745279  0.08428835 -0.18212499  0.077195419  0.17638084
## T_925_M1     0.26073294  0.07149240 -0.18056386  0.052828052  0.17352946
## DECLINATION -0.03213909  0.03771255 -0.21109628 -0.021729648  0.05668091
## D2_M1        0.40001171  0.19830101  0.13464191  0.181090800  0.24352796
## D2Obs        0.47678192  0.20345976  0.14652865  0.195814995  0.25890049
##                    SKT      RH_950       T_950       RH_925        T_925
## T2Obs        0.8863533  0.18680122  0.88894005  0.216119199  0.853054151
## MSL         -0.4163939 -0.33294933 -0.40035466 -0.405869358 -0.350232568
## T2           0.9600302  0.27585845  0.86378443  0.302198191  0.827656727
## D2           0.9324701  0.37134581  0.86197674  0.367670428  0.835495250
## U10          0.2609065 -0.05612313  0.47067278 -0.055473121  0.457192073
## V10          0.4266330  0.30877422  0.30829153  0.269755270  0.312590355
## TP           0.2395191  0.31945589  0.10720762  0.358310478  0.092973241
## LCC          0.2257597  0.69139249 -0.16829073  0.655053077 -0.164272121
## MCC          0.2447308  0.24149987  0.09444364  0.297044509  0.068320297
## HCC          0.2741649  0.19327126  0.19841297  0.210181730  0.193252028
## SKT          1.0000000  0.38351744  0.76362561  0.381873995  0.734760823
## RH_950       0.3835174  1.00000000 -0.04652156  0.906652360 -0.071677023
## T_950        0.7636256 -0.04652156  1.00000000  0.017985727  0.977395344
## RH_925       0.3818740  0.90665236  0.01798573  1.000000000 -0.064795061
## T_925        0.7347608 -0.07167702  0.97739534 -0.064795061  1.000000000
## RH_850       0.2783907  0.55757484  0.05592327  0.698828040 -0.001843391
## T_850        0.7021015  0.03952116  0.84295487  0.001892902  0.890126622
## RH_700       0.1633512  0.34207418 -0.02901140  0.399889400 -0.061724744
## T_700        0.6400933  0.07562355  0.70508318  0.048522955  0.747717931
## RH_500       0.1528910  0.15485791  0.08770309  0.171523683  0.077657062
## T_500        0.5513347  0.02007681  0.60901726  0.004119243  0.641986519
## T2_M1        0.9371965  0.22590702  0.86236404  0.248814962  0.827904282
## T_950_M1     0.7587190 -0.05353429  0.98032614  0.007487770  0.956520188
## T_925_M1     0.7308665 -0.07722763  0.96649309 -0.062380455  0.979534164
## DECLINATION  0.4122035 -0.09919816  0.44368980 -0.065279659  0.417626106
## D2_M1        0.9224356  0.33473479  0.85860018  0.330967322  0.834474987
## D2Obs        0.8969396  0.36156717  0.85307293  0.361084178  0.826701929
##                   RH_850        T_850      RH_700       T_700      RH_500
## T2Obs        0.167963469  0.769399085  0.04118724  0.67602135  0.09403511
## MSL         -0.399022913 -0.218716663 -0.29359409 -0.07051071 -0.24568557
## T2           0.228365501  0.764106337  0.09996808  0.68220320  0.13023430
## D2           0.264707565  0.775681835  0.11343631  0.68212734  0.13673898
## U10         -0.069646096  0.351517029 -0.07876769  0.21442493 -0.07298243
## V10          0.168995321  0.397286638  0.10825689  0.41755364  0.17528982
## TP           0.437441125  0.082223619  0.41840813  0.09861702  0.28480045
## LCC          0.497464051 -0.036164710  0.29728990  0.05299987  0.13914094
## MCC          0.467002703  0.007852306  0.60849345  0.03655330  0.61635646
## HCC          0.162510836  0.249658274  0.22022954  0.28319161  0.51377097
## SKT          0.278390678  0.702101545  0.16335123  0.64009330  0.15289102
## RH_950       0.557574843  0.039521157  0.34207418  0.07562355  0.15485791
## T_950        0.055923274  0.842954866 -0.02901140  0.70508318  0.08770309
## RH_925       0.698828040  0.001892902  0.39988940  0.04852296  0.17152368
## T_925       -0.001843391  0.890126622 -0.06172474  0.74771793  0.07765706
## RH_850       1.000000000 -0.118710690  0.59675402 -0.05735586  0.24044789
## T_850       -0.118710690  1.000000000 -0.10183030  0.89325708  0.04000299
## RH_700       0.596754017 -0.101830300  1.00000000 -0.11862092  0.41038018
## T_700       -0.057355860  0.893257084 -0.11862092  1.00000000  0.05992118
## RH_500       0.240447893  0.040002992  0.41038018  0.05992118  1.00000000
## T_500       -0.075711362  0.770578774 -0.09553026  0.89912143  0.03878061
## T2_M1        0.180628282  0.749989737  0.06321443  0.65443338  0.10908679
## T_950_M1     0.029034251  0.819861647 -0.03979655  0.67013543  0.07289999
## T_925_M1    -0.032145229  0.871372768 -0.07769362  0.72071438  0.06414881
## DECLINATION -0.013263616  0.338130424 -0.07407492  0.32886792 -0.01122715
## D2_M1        0.226791008  0.765859911  0.08582535  0.66284002  0.11957414
## D2Obs        0.268052174  0.765794084  0.10873350  0.67134133  0.13140822
##                    T_500       T2_M1    T_950_M1    T_925_M1 DECLINATION
## T2Obs        0.587334852  0.94419536  0.88398440  0.85172114  0.45237252
## MSL          0.004315872 -0.42702884 -0.41161104 -0.36673136 -0.02959616
## T2           0.591483005  0.97794187  0.85505911  0.82271175  0.44224781
## D2           0.574345804  0.93939358  0.84622140  0.82289201  0.38014817
## U10          0.179418126  0.35542012  0.46579184  0.46300772  0.08407667
## V10          0.365181649  0.36028397  0.24745279  0.26073294 -0.03213909
## TP           0.080319866  0.16101713  0.08428835  0.07149240  0.03771255
## LCC          0.019844435  0.02152608 -0.18212499 -0.18056386 -0.21109628
## MCC          0.065538841  0.16702137  0.07719542  0.05282805 -0.02172965
## HCC          0.323102075  0.23265653  0.17638084  0.17352946  0.05668091
## SKT          0.551334698  0.93719654  0.75871903  0.73086652  0.41220355
## RH_950       0.020076813  0.22590702 -0.05353429 -0.07722763 -0.09919816
## T_950        0.609017256  0.86236404  0.98032614  0.96649309  0.44368980
## RH_925       0.004119243  0.24881496  0.00748777 -0.06238045 -0.06527966
## T_925        0.641986519  0.82790428  0.95652019  0.97953416  0.41762611
## RH_850      -0.075711362  0.18062828  0.02903425 -0.03214523 -0.01326362
## T_850        0.770578774  0.74998974  0.81986165  0.87137277  0.33813042
## RH_700      -0.095530263  0.06321443 -0.03979655 -0.07769362 -0.07407492
## T_700        0.899121431  0.65443338  0.67013543  0.72071438  0.32886792
## RH_500       0.038780606  0.10908679  0.07289999  0.06414881 -0.01122715
## T_500        1.000000000  0.56299998  0.57860779  0.61653798  0.33501718
## T2_M1        0.562999978  1.00000000  0.87358417  0.83523137  0.46319083
## T_950_M1     0.578607789  0.87358417  1.00000000  0.97512160  0.45672756
## T_925_M1     0.616537983  0.83523137  0.97512160  1.00000000  0.42995698
## DECLINATION  0.335017182  0.46319083  0.45672756  0.42995698  1.00000000
## D2_M1        0.552458828  0.95904366  0.86190071  0.83612281  0.38729962
## D2Obs        0.564374564  0.90946289  0.83419344  0.81248000  0.33219331
##                   D2_M1      D2Obs
## T2Obs        0.92448408  0.9419332
## MSL         -0.44043299 -0.4554616
## T2           0.95124248  0.9305434
## D2           0.97995979  0.9664760
## U10          0.35241699  0.3824027
## V10          0.40001171  0.4767819
## TP           0.19830101  0.2034598
## LCC          0.13464191  0.1465286
## MCC          0.18109080  0.1958150
## HCC          0.24352796  0.2589005
## SKT          0.92243558  0.8969396
## RH_950       0.33473479  0.3615672
## T_950        0.85860018  0.8530729
## RH_925       0.33096732  0.3610842
## T_925        0.83447499  0.8267019
## RH_850       0.22679101  0.2680522
## T_850        0.76585991  0.7657941
## RH_700       0.08582535  0.1087335
## T_700        0.66284002  0.6713413
## RH_500       0.11957414  0.1314082
## T_500        0.55245883  0.5643746
## T2_M1        0.95904366  0.9094629
## T_950_M1     0.86190071  0.8341934
## T_925_M1     0.83612281  0.8124800
## DECLINATION  0.38729962  0.3321933
## D2_M1        1.00000000  0.9529205
## D2Obs        0.95292046  1.0000000

Linear regression model for T2

We use the correlation matrix and the scatter plots to find out the predictor variables for our model which predicts T2.

Correlation coefficients and scatter plots of the the variables of the dataset df_winter_00_08 with T2Obs

We study and plot the relationship between T2Obs and the ECMWF model variables. The correlation coefficients of of the the variables of the dataset df_winter_00_08 with T2Obs are given below.


                T2Obs          MSL          T2         D2         U10         V10          TP         LCC
T2Obs       1.00000000 -0.413811966  0.94819952  0.9283360  0.39888861  0.40988014  0.12437463 -0.03288617
                MCC         HCC        SKT      RH_950       T_950       RH_925        T_925       RH_850
T2Obs        0.145251318  0.23019018  0.8863533  0.18680122  0.88894005  0.216119199  0.853054151  0.167963469
               T_850      RH_700       T_700      RH_500        T_500       T2_M1    T_950_M1    T_925_M1
T2Obs        0.769399085  0.04118724  0.67602135  0.09403511  0.587334852  0.94419536  0.88398440  0.85172114
            DECLINATION       D2_M1      D2Obs
T2Obs        0.45237252  0.92448408  0.9419332

We can observe from the scatter plots that all “temperature variables” are linearly correlated with T2Obs. Please zoom the figure to see the details.

library (ggplot2)
par(mfrow=c(4,7)) 
for (i in 7:31) {
  plot(df_winter_00_08[,i],df_winter_00_08[,6], ylab = "T2Obs", xlab = colnames(df_winter_00_08[i]))
}

First linear regression model for T2

For our first model, we take all the variables which have a positive correlation higher than 0.8 with T2Obs and all the variables which are negatively correlated with T2Obs.

All the “temperature variables” are linearly correlated with T2Obs but some have correlation coefficients less than 0.8. We take all the temperature variables that have higher correlation coefficients than 0.8 with T2Obs. So we use the temperature variables T2, T_950, T_925, T2_M1, T2_950_M1, T2_925_M1 and SKT in our model. D2 and D2_M1 are also have high correlation coefficient with T2Obs and are linearly distributed with T2Obs, we include D2 in our model.

Even though the RH variables are seen to be randomly distributed with T2Obs, they have correlations less than 0.8 and we exclude them from our model. Even though DECLINATION is linearly correlated with T2Obs, its correlation coefficient is less than 0.5, we exclude it from our model. In the case of TP, it is not linearly correlated with T2Obs and has very small correlation coefficient, we exclude it from the model.

From the scatter plots, we can see that the cloud cover variables LCC, MCC and HCC are randomly correlated with T2Obs and have low correlation coefficients with T2Obs. As LCC has a negative correlation with T2Obs, we take LCC to our list of predictors. As MSL also has a negative correlation coefficient, we also include it to our list of predictors. The wind velocity variables, U10 and V10 are randomly distributed and have low correlation coefficients we exclude them from our model.

Based on the above hypothesis, the formula for our initial model is ** T2Obs ~ T2 + D2 + T_950 + T_925 + T2_M1 + T_950_M1 + T_925_M1 + D2_M1 + SKT + LCC + MSL **

We use the R function “lm” to carry out the linear regression model.

The summary of our first model, my_model1_T2 shows that it has a Residual standard error: 1.428. F-statistic: 506.9 and the p-value: < 2.2e-16. The p-value shows the probability that the null hypothesis is true. If the null hypothesis is true, it is an intercept only model and there is no linear relationship between the predictand and the predictors. In our case as the p value, is very small, the null hypothesis is false and there is a linear relationship between the predictand and the predictors.

The Fischer value or F-statistic is the ratio of explained variance and unexplained variance and has a value near 1 if the null hypothesis is true. F-statistic is defined as F-statistic = MSM (Mean of Squares for Model) / MSE (Mean of Squares for Error) . In our case F-statistic is 506.9, so our model rejects the null hypothesis.

The t value is the value of the t-statistic for testing whether the corresponding regression coefficient is different from 0. As a rule of thump, a t value around or greater than 2 is good.

Pr. value is the p-value for the hypothesis test for which the t value is the test statistic. It gives the probability of a test statistic at least as the t value obtained, if the null hypothesis were true. A low Pr value for the variables shows that there is linear relationship between this variable and the predictor. The Signif codes actually shows the ordering based on p values, a lower p value, higher is the significance.

Based on the F-statistic and p value we see that our model is a good one.

my.model1.T2 <- lm(T2Obs  ~ T2 + D2 + T_950 + T_925 + T2_M1 + T_950_M1 + T_925_M1 + D2_M1 + SKT + LCC + MSL , data = T2_winter_00_08_train)
summary(my.model1.T2)
## 
## Call:
## lm(formula = T2Obs ~ T2 + D2 + T_950 + T_925 + T2_M1 + T_950_M1 + 
##     T_925_M1 + D2_M1 + SKT + LCC + MSL, data = T2_winter_00_08_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.3744 -0.5531  0.1222  0.7663  4.5287 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.695e+00  8.465e+00   0.909  0.36387    
## T2           2.712e-01  1.182e-01   2.294  0.02231 *  
## D2           4.895e-01  1.016e-01   4.820 2.02e-06 ***
## T_950        1.229e-01  1.264e-01   0.973  0.33129    
## T_925       -1.227e-01  1.084e-01  -1.131  0.25851    
## T2_M1        2.856e-01  1.061e-01   2.692  0.00738 ** 
## T_950_M1     5.766e-02  1.175e-01   0.491  0.62388    
## T_925_M1     6.014e-02  1.087e-01   0.553  0.58044    
## D2_M1       -1.460e-01  9.609e-02  -1.520  0.12932    
## SKT         -4.030e-02  5.794e-02  -0.695  0.48716    
## LCC         -1.369e+00  2.582e-01  -5.304 1.84e-07 ***
## MSL          4.347e-06  5.328e-05   0.082  0.93501    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.428 on 418 degrees of freedom
## Multiple R-squared:  0.9303, Adjusted R-squared:  0.9284 
## F-statistic: 506.9 on 11 and 418 DF,  p-value: < 2.2e-16

Second linear regression model for T2

We try in this step to see whether we can improve our first model. Based on the t value and Pr value for the variables obtained in our first model, we include the variables which have Pr value less than 0.5. This results in a new formula for the linear regression as

** T2Obs ~ T2 + D2 + T_950 + T_925 + T2_M1 + D2_M1 + SKT + LCC **

The summary of the model shows that F-statistic has increased from 506 to 698. The p value remains as the same low value, the residual standard error has almost same (1.428 for the first model and 1.426 for the second model)

my.model2.T2 <- lm(T2Obs  ~ T2 + D2 + T_950 + T_925 + T2_M1 + D2_M1 + SKT + LCC , data = T2_winter_00_08_train)
summary(my.model2.T2)
## 
## Call:
## lm(formula = T2Obs ~ T2 + D2 + T_950 + T_925 + T2_M1 + D2_M1 + 
##     SKT + LCC, data = T2_winter_00_08_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.2392 -0.5785  0.1247  0.8006  4.6234 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.07884    4.49191   2.021   0.0439 *  
## T2           0.26398    0.11688   2.259   0.0244 *  
## D2           0.45513    0.09786   4.651 4.43e-06 ***
## T_950        0.18812    0.08061   2.334   0.0201 *  
## T_925       -0.07796    0.06756  -1.154   0.2492    
## T2_M1        0.30452    0.10092   3.017   0.0027 ** 
## D2_M1       -0.11617    0.09282  -1.252   0.2114    
## SKT         -0.04323    0.05760  -0.750   0.4534    
## LCC         -1.35913    0.25611  -5.307 1.81e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.426 on 421 degrees of freedom
## Multiple R-squared:  0.9299, Adjusted R-squared:  0.9286 
## F-statistic: 698.6 on 8 and 421 DF,  p-value: < 2.2e-16

Third linear regression model for T2

We try to see if we can improve the model still further. Based on the t value and Pr value for the variables obtained in our second model, we include only the variables which have Pr value less than 0.1. This results in a new formula for the linear regression as

** T2Obs ~ T2 + D2 + T_950 + T2_M1 + LCC **

The summary of the model shows that F-statistic has increased and has a value 1114. The p value remains as the same low value, the residual standard error is the same as the first model.

my.model3.T2 <- lm(T2Obs  ~ T2 + D2 + T_950 + T2_M1 +  LCC  , data = T2_winter_00_08_train)
summary(my.model3.T2)
## 
## Call:
## lm(formula = T2Obs ~ T2 + D2 + T_950 + T2_M1 + LCC, data = T2_winter_00_08_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.4897 -0.6502  0.1423  0.7904  4.4750 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.05288    4.24875   1.895  0.05873 .  
## T2           0.32245    0.08019   4.021 6.85e-05 ***
## D2           0.34431    0.05699   6.042 3.34e-09 ***
## T_950        0.11709    0.03781   3.097  0.00209 ** 
## T2_M1        0.19490    0.06446   3.024  0.00265 ** 
## LCC         -1.43573    0.24820  -5.785 1.41e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.428 on 424 degrees of freedom
## Multiple R-squared:  0.9293, Adjusted R-squared:  0.9284 
## F-statistic:  1114 on 5 and 424 DF,  p-value: < 2.2e-16

The selected linear model for T2

From the above analysis we saw that we do not gain much in terms of residual error and p value when we move from model 1 to model 3. Our analysis is based only on the winter data, we cannot strongly state that the five variables for the third model are the only ones that significantly affect T2. As a compromise we select the second model with eight variables, which has a higher F value than the first model. The linear regression formula for our model is taken as

** T2Obs ~ T2 + D2 + T_950 + T_925 + T2_M1 + D2_M1 + SKT + LCC **

my.model.T2 <- lm(T2Obs  ~ T2 + D2 + T_950 + T_925 + T2_M1 + D2_M1 + SKT + LCC , data = T2_winter_00_08_train)
summary(my.model.T2)
## 
## Call:
## lm(formula = T2Obs ~ T2 + D2 + T_950 + T_925 + T2_M1 + D2_M1 + 
##     SKT + LCC, data = T2_winter_00_08_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.2392 -0.5785  0.1247  0.8006  4.6234 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.07884    4.49191   2.021   0.0439 *  
## T2           0.26398    0.11688   2.259   0.0244 *  
## D2           0.45513    0.09786   4.651 4.43e-06 ***
## T_950        0.18812    0.08061   2.334   0.0201 *  
## T_925       -0.07796    0.06756  -1.154   0.2492    
## T2_M1        0.30452    0.10092   3.017   0.0027 ** 
## D2_M1       -0.11617    0.09282  -1.252   0.2114    
## SKT         -0.04323    0.05760  -0.750   0.4534    
## LCC         -1.35913    0.25611  -5.307 1.81e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.426 on 421 degrees of freedom
## Multiple R-squared:  0.9299, Adjusted R-squared:  0.9286 
## F-statistic: 698.6 on 8 and 421 DF,  p-value: < 2.2e-16

Predicting T2 with my model

We now see the predicting ability of our model by using it to predict T2 for the test data T2_winter_00_08_test. The results show that the root mean square error (rmse) is 1.426. Our linear regression model for T2 is

T2  =  9.07883695 + 0.26398396 * T2 (of the ECMWF model) + 0.45513159 * D2 (of the ECMWF model) + 0.18812205 * T_950  - 0.07796135 * T_925 + 0.30452258 * T2_M1  - 0.11617459 * D2_M1 - 0.04322546 * SKT - 1.35912603 * LCC
# calculating the fit
  fitted.values <- predict(object = my.model.T2, newdata <- T2_winter_00_08_test)
  T2pred.mymodel <- fitted.values
  
  fitted.values <- as.numeric(fitted.values)
  residuals <- as.numeric(T2_winter_00_08_test[,1] - fitted.values)
  coefficients <- my.model.T2$coefficients
  rmse.mymodel.T2 <- sqrt(mean(residuals^2))
  coefficients
## (Intercept)          T2          D2       T_950       T_925       T2_M1 
##  9.07883695  0.26398396  0.45513159  0.18812205 -0.07796135  0.30452258 
##       D2_M1         SKT         LCC 
## -0.11617459 -0.04322546 -1.35912603
  rmse.mymodel.T2 
## [1] 1.462304

Linear regression model for D2

We use the correlation matrix and the scatter plots to find out the predictor variables for our model for D2.

Correlation and Scatter plots of the variables of the dataset df_winter_00_08 with D2Obs

We study and lot the relationship between D2Obs and the ECMWF model variables. The correlation coefficients


                T2Obs          MSL          T2         D2         U10         V10          TP         LCC
D2Obs        0.94193317 -0.455461577  0.93054344  0.9664760  0.38240272  0.47678192  0.20345976  0.14652865
                MCC         HCC        SKT      RH_950       T_950       RH_925        T_925       RH_850
D2Obs        0.195814995  0.25890049  0.8969396  0.36156717  0.85307293  0.361084178  0.826701929  0.268052174
               T_850      RH_700       T_700      RH_500        T_500       T2_M1    T_950_M1    T_925_M1
D2Obs        0.765794084  0.10873350  0.67134133  0.13140822  0.564374564  0.90946289  0.83419344  0.81248000
               DECLINATION   D2_M1      D2Obs
D2Obs        0.33219331  0.95292046  1.0000000

We can observe from the scatter plots that all “temperature variables” are linearly correlated with D2Obs. Please zoom the figure to see the details.

library (ggplot2)
par(mfrow=c(4,7)) 
for (i in 7:31) {
  plot(df_winter_00_08[,i],df_winter_00_08[,32], ylab = "D2Obs", xlab = colnames(df_winter_00_08[i]))
}

First linear regression model for D2

Similar to the modelling of T2Obs, for our first model, we take all the variables which have a positive correlation higher than 0.8 with D2Obs and all the variables which are negatively correlated with D2Obs.

All the “temperature variables” are linearly correlated with D2Obs but some have correlation coefficients less than 0.8. We take all the temperature variables that have higher correlation coefficients than 0.8 with D2Obs. So we use the temperature variables T2, T_950, T_925, T2_M1, T2_950_M1, T2_925_M1 and SKT in our model. D2 and D2_M1 are also have high correlation coefficient with D2Obs and are linearly distributed with T2Obs, we include D2 in our model.

The RH variables are seen to be randomly distributed with D2Obs and they have correlations less than 0.8 and we exclude them from our model. Even though DECLINATION is linearly correlated with D2Obs, its correlation coefficient is less than 0.5, we exclude it from our model.

As the cloud cover variables LCC, MCC and HCC are randomly correlated with D2Obs and have low correlation coefficients with T2Obs. we exclude them from our model. As MSL has a negative correlation coefficient, we also include it to our list of predictors. The wind velocity variables, U10 and V10 are randomly distributed and have low correlation coefficients.

From a layman’s point of view, we have a feeling that D2 may be related to LCC, TP, RH_950 and V10 even though they are not linearly correlated with D2Obs and has very small correlation coefficient.

Based on the above hypothesis, the formula for our initial model is

** D2Obs ~ T2 + D2 + T_950 + T_925 + T2_M1 + T_950_M1 + T_925_M1 + D2_M1 + SKT + MSL + TP + RH_950 + LCC**

The summary of our first model, my_model1_D2 shows that it has a Residual standard error: 1.447. F-statistic: 486.7 and the p-value: < 2.2e-16. As the p value, is very small, the null hypothesis is false and there is a linear relationship between the predictand and the predictors.

my_model1_D2 <- lm(D2Obs ~ T2 + D2 + T_950 + T_925 + T2_M1 + T_950_M1 + T_925_M1 + D2_M1 + SKT + MSL + TP + RH_950 + + V10 + LCC, data = D2_winter_00_08_train)
summary(my_model1_D2)
## 
## Call:
## lm(formula = D2Obs ~ T2 + D2 + T_950 + T_925 + T2_M1 + T_950_M1 + 
##     T_925_M1 + D2_M1 + SKT + MSL + TP + RH_950 + +V10 + LCC, 
##     data = D2_winter_00_08_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.9569 -0.7827  0.0145  0.8490  5.1294 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.160e+01  9.767e+00   1.188 0.235631    
## T2          -4.629e-02  1.216e-01  -0.381 0.703674    
## D2           5.465e-01  1.133e-01   4.824 1.98e-06 ***
## T_950        3.214e-01  1.297e-01   2.478 0.013608 *  
## T_925        3.868e-03  1.104e-01   0.035 0.972074    
## T2_M1       -1.111e-01  1.079e-01  -1.029 0.304000    
## T_950_M1    -1.584e-01  1.201e-01  -1.319 0.187916    
## T_925_M1     4.409e-02  1.102e-01   0.400 0.689370    
## D2_M1        3.059e-01  9.887e-02   3.094 0.002109 ** 
## SKT          7.233e-02  5.931e-02   1.219 0.223357    
## MSL         -7.530e-05  6.167e-05  -1.221 0.222782    
## TP          -7.737e-01  4.781e-01  -1.618 0.106377    
## RH_950       3.217e-02  7.219e-03   4.456 1.07e-05 ***
## V10          8.348e-02  2.304e-02   3.622 0.000328 ***
## LCC         -7.320e-01  2.785e-01  -2.629 0.008891 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.447 on 415 degrees of freedom
## Multiple R-squared:  0.9426, Adjusted R-squared:  0.9407 
## F-statistic: 486.7 on 14 and 415 DF,  p-value: < 2.2e-16

Second linear regression model for D2

We try in this step to see whether we can improve our first model. Based on the t value and Pr value for the variables obtained in our first model, we include the variables which have Pr value less than 0.5. Even though T2 has a Pr greater than 0.5 we include T2 in the second model also as we believe that T2 and D2 are related. As we have seen in the quantile plot of Winter data, the ECMWF model T2 deviates from the observed T2Obs widely. This may be the reason for high Pr value for T2. This results in a new formula for the linear regression as

** D2Obs ~ T2 + D2 + T_950 + T2_M1 + T_950_M1 + D2_M1 + SKT + MSL + TP + RH_950 + V10 + LCC **

The summary of the model shows that F-statistic has increased from 486.7 to 569.9. The p value remains as the same low value, the residual standard error has almost same (1.447for the first model and 1.444 for the second model)

my_model2_D2 <- lm(D2Obs ~ T2 + D2 + T_950 + T2_M1 + T_950_M1 + D2_M1 + SKT + MSL + TP + RH_950 + V10 + LCC, data = D2_winter_00_08_train)
summary(my_model2_D2)
## 
## Call:
## lm(formula = D2Obs ~ T2 + D2 + T_950 + T2_M1 + T_950_M1 + D2_M1 + 
##     SKT + MSL + TP + RH_950 + V10 + LCC, data = D2_winter_00_08_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.9748 -0.7604  0.0283  0.8673  5.1340 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.161e+01  9.731e+00   1.193 0.233647    
## T2          -5.119e-02  1.194e-01  -0.429 0.668300    
## D2           5.435e-01  1.128e-01   4.820 2.02e-06 ***
## T_950        3.366e-01  9.112e-02   3.694 0.000250 ***
## T2_M1       -1.212e-01  1.056e-01  -1.148 0.251666    
## T_950_M1    -1.248e-01  8.511e-02  -1.467 0.143227    
## D2_M1        3.153e-01  9.758e-02   3.231 0.001330 ** 
## SKT          7.880e-02  5.831e-02   1.352 0.177268    
## MSL         -7.151e-05  6.100e-05  -1.172 0.241729    
## TP          -7.800e-01  4.771e-01  -1.635 0.102818    
## RH_950       3.166e-02  7.157e-03   4.424 1.24e-05 ***
## V10          8.464e-02  2.286e-02   3.703 0.000242 ***
## LCC         -7.352e-01  2.775e-01  -2.649 0.008372 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.444 on 417 degrees of freedom
## Multiple R-squared:  0.9425, Adjusted R-squared:  0.9409 
## F-statistic: 569.9 on 12 and 417 DF,  p-value: < 2.2e-16

Third linear regression model for D2

We try in this step to see whether we can improve our first model. Based on the t value and Pr value for the variables obtained in our first model, we include the variables which have Pr value around 0.1. Even though T2 has a Pr greater than 0.5 we include T2 in the third model also as we believe that T2 and D2 are related. This results in a new formula for the linear regression as

** D2Obs ~ T2 + D2 + T_950 + T_950_M1 + D2_M1 + TP + RH_950 + V10 + LCC **

The summary of the model shows that F-statistic has increased from 569.9 to 756.7. The p value remains as the same low value, the residual standard error has almost same (1.446 for the third model and 1.444 for the second model)

my_model3_D2 <- lm(D2Obs ~ T2 + D2 + T_950 + T_950_M1 + D2_M1 + TP + RH_950 + V10 + LCC, data = D2_winter_00_08_train)
summary(my_model3_D2)
## 
## Call:
## lm(formula = D2Obs ~ T2 + D2 + T_950 + T_950_M1 + D2_M1 + TP + 
##     RH_950 + V10 + LCC, data = D2_winter_00_08_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0105 -0.7589 -0.0108  0.8660  5.1765 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.472731   4.564740   0.542 0.588311    
## T2          -0.063511   0.055724  -1.140 0.255043    
## D2           0.564682   0.097071   5.817 1.19e-08 ***
## T_950        0.364282   0.087990   4.140 4.20e-05 ***
## T_950_M1    -0.148566   0.082889  -1.792 0.073798 .  
## D2_M1        0.265225   0.070523   3.761 0.000193 ***
## TP          -0.603928   0.455154  -1.327 0.185274    
## RH_950       0.035618   0.006496   5.483 7.22e-08 ***
## V10          0.082966   0.022751   3.647 0.000299 ***
## LCC         -0.629401   0.263309  -2.390 0.017272 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.446 on 420 degrees of freedom
## Multiple R-squared:  0.9419, Adjusted R-squared:  0.9407 
## F-statistic: 756.7 on 9 and 420 DF,  p-value: < 2.2e-16

The selected linear model for D2

From the above analysis we saw that we do not gain much in terms of residual error and p value when we move from model 1 to model 3. As the model 3 has 9 variables and they have quite good Pr values we select the third model with eight variables, which has a higher F value than the other two models.

The linear regression formula for our model is taken as

** D2Obs ~ T2 + D2 + T_950 + T_950_M1 + D2_M1 + TP + RH_950 + V10 + LCC **

my.model.D2 <- lm(D2Obs ~ T2 + D2 + T_950 + T_950_M1 + D2_M1 + TP + RH_950 + V10 + LCC, data = D2_winter_00_08_train)
summary(my.model.D2)
## 
## Call:
## lm(formula = D2Obs ~ T2 + D2 + T_950 + T_950_M1 + D2_M1 + TP + 
##     RH_950 + V10 + LCC, data = D2_winter_00_08_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0105 -0.7589 -0.0108  0.8660  5.1765 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.472731   4.564740   0.542 0.588311    
## T2          -0.063511   0.055724  -1.140 0.255043    
## D2           0.564682   0.097071   5.817 1.19e-08 ***
## T_950        0.364282   0.087990   4.140 4.20e-05 ***
## T_950_M1    -0.148566   0.082889  -1.792 0.073798 .  
## D2_M1        0.265225   0.070523   3.761 0.000193 ***
## TP          -0.603928   0.455154  -1.327 0.185274    
## RH_950       0.035618   0.006496   5.483 7.22e-08 ***
## V10          0.082966   0.022751   3.647 0.000299 ***
## LCC         -0.629401   0.263309  -2.390 0.017272 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.446 on 420 degrees of freedom
## Multiple R-squared:  0.9419, Adjusted R-squared:  0.9407 
## F-statistic: 756.7 on 9 and 420 DF,  p-value: < 2.2e-16

Predicting D2 with the my.model.D2

We now see the predicting ability of our model by using it to predict D2 for the test data D2_winter_00_08_test. The results show that the root mean square error (rmse) is 1.351. Our linear regression model for D2 is

D2  =  2.472731 - 0.063511 *  T2 (of the ECMWF model) + 0.564682 * D2 (of the ECMWF model) + 0.364282 * T_950  - 0.148566* T_950_M1 + 0.265225 *D2_M1 - 0.603928* TP +  0.035618 * RH_950 + 0.082966 * V10 - 0.629401* LCC
# calculating the fit
  fitted.values <- predict(object = my.model.D2, newdata <- D2_winter_00_08_test)
  D2pred.mymodel <- fitted.values
  
  fitted.values <- as.numeric(fitted.values)
  residuals <- as.numeric(D2_winter_00_08_test[,1] - fitted.values)
  coefficients <- my.model.D2$coefficients
  rmse.mymodel.D2 <- sqrt(mean(residuals^2))
  coefficients 
## (Intercept)          T2          D2       T_950    T_950_M1       D2_M1 
##  2.47273050 -0.06351071  0.56468186  0.36428231 -0.14856581  0.26522469 
##          TP      RH_950         V10         LCC 
## -0.60392832  0.03561787  0.08296641 -0.62940091
  rmse.mymodel.D2
## [1] 1.351384

Comparing my.model.T2 and my.model.D2 with standard linear regression models

In order to analyze my.model, we compare it with models produced by three other linear regression models. The first one is known as the “lm model” which uses R function lm with all the variables of the dataset.

The second is the “lmstep”" model which uses the R “step” algorithm. This algorithm develops a sequence of linear models, removing or adding a variable at each step. A model is updated when it has a lower Akaike’s Information Criterion (AIC) than the original model. AIC is defined as $AIC  =  = -N , log(frac{RSS}{N}) + 2k $ where N is the number of observations, RSS is the residual sum of squares and k is the number of degrees of freedom of the original model.

The third method is the glmcv, Gaussian linear regression model with cross validation. Here the training dataset is divided to K equal parts and always one part is set aside as test data. The idea behind the K-fold cross validation is that the training of the linear regression model is done with K-1 parts and cross validation (prediction with cross validation data) is done with the part which is set aside. The prediction error is calculated for this part. In the next iteration the set-aside part is included in the training data which forms a K-1 group and a new part is set aside. The iterations continue in a modular fashion, each time K-1 parts used for training and one part is used for cross validation. For each iteration prediction error is calculated and glmcv choose the iteration with the least error and selects that model.

We then use the models of lm, lmstep and glmcv to predict T2 and D2 using the test data. We use the root mean square error (RMSE) to compare the different models.

Modelling with lm for T2

For the lm model we use all the model variables. With this model we get Residual standard error: 1.413, F-statistic: 228.9 and p-value: < 2.2e-16

lm.model.T2 <- lm(T2Obs ~ ., data = T2_winter_00_08_train)
summary(lm.model.T2)
## 
## Call:
## lm(formula = T2Obs ~ ., data = T2_winter_00_08_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.2034 -0.6131  0.1082  0.7946  4.4716 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.788e+01  1.111e+01   1.609 0.108370    
## MSL         -4.625e-05  7.012e-05  -0.660 0.509912    
## T2           1.954e-01  1.218e-01   1.604 0.109479    
## D2           4.390e-01  1.151e-01   3.815 0.000158 ***
## U10          1.569e-02  2.537e-02   0.618 0.536789    
## V10          6.873e-02  2.387e-02   2.880 0.004192 ** 
## TP          -1.122e+00  5.206e-01  -2.155 0.031723 *  
## LCC         -1.438e+00  2.996e-01  -4.800 2.24e-06 ***
## MCC         -9.167e-02  2.640e-01  -0.347 0.728581    
## HCC          1.279e-02  2.031e-01   0.063 0.949846    
## SKT         -1.651e-02  5.941e-02  -0.278 0.781277    
## RH_950       1.252e-02  1.336e-02   0.938 0.349002    
## T_950        2.053e-01  1.535e-01   1.338 0.181739    
## RH_925      -1.790e-02  1.235e-02  -1.449 0.148033    
## T_925       -3.156e-01  1.474e-01  -2.141 0.032904 *  
## RH_850       1.170e-02  5.390e-03   2.171 0.030530 *  
## T_850        3.434e-02  6.016e-02   0.571 0.568491    
## RH_700      -4.971e-04  3.395e-03  -0.146 0.883646    
## T_700        1.454e-02  4.713e-02   0.308 0.757918    
## RH_500       2.578e-04  2.995e-03   0.086 0.931455    
## T_500        2.594e-02  3.145e-02   0.825 0.409956    
## T2_M1        3.035e-01  1.072e-01   2.830 0.004885 ** 
## T_950_M1     1.147e-01  1.196e-01   0.959 0.338050    
## T_925_M1     8.026e-02  1.102e-01   0.728 0.466943    
## DECLINATION  1.130e-02  1.592e-02   0.710 0.478269    
## D2_M1       -1.190e-01  9.902e-02  -1.201 0.230269    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.413 on 404 degrees of freedom
## Multiple R-squared:  0.9341, Adjusted R-squared:   0.93 
## F-statistic: 228.9 on 25 and 404 DF,  p-value: < 2.2e-16

Predicting T2 with the lm.model.T2

We now see the predicting ability of lm.model.T2 by using it to predict T2 for the test data T2_winter_00_08_test. The results show that the root mean square error (rmse) is 1.411888.

# calculating the fit
  fitted.values <- predict(object = lm.model.T2, newdata <- T2_winter_00_08_test)
  T2pred.lm <- fitted.values
  
  fitted.values <- as.numeric(fitted.values)
  residuals <- as.numeric(T2_winter_00_08_test[,1] - fitted.values)
  coefficients <- lm.model.T2$coefficients
  rmse.lm.T2 <- sqrt(mean(residuals^2))
  coefficients 
##   (Intercept)           MSL            T2            D2           U10 
##  1.787541e+01 -4.624988e-05  1.954199e-01  4.390128e-01  1.568595e-02 
##           V10            TP           LCC           MCC           HCC 
##  6.872884e-02 -1.122150e+00 -1.437913e+00 -9.167406e-02  1.278564e-02 
##           SKT        RH_950         T_950        RH_925         T_925 
## -1.650709e-02  1.252312e-02  2.053390e-01 -1.789853e-02 -3.156258e-01 
##        RH_850         T_850        RH_700         T_700        RH_500 
##  1.170128e-02  3.433758e-02 -4.971310e-04  1.453740e-02  2.577569e-04 
##         T_500         T2_M1      T_950_M1      T_925_M1   DECLINATION 
##  2.594027e-02  3.035225e-01  1.147081e-01  8.025665e-02  1.130134e-02 
##         D2_M1 
## -1.189730e-01
  rmse.lm.T2
## [1] 1.411888

Modelling with lm for D2

For the lm model we use all the model variables. With this model we get Residual standard error: 1.438, F-statistic: 276.2 and p-value: < 2.2e-16

lm.model.D2 <- lm(D2Obs ~ ., data = D2_winter_00_08_train)
summary(lm.model.D2)
## 
## Call:
## lm(formula = D2Obs ~ ., data = D2_winter_00_08_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1028 -0.7146 -0.0346  0.8332  5.1312 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.370e+00  1.131e+01  -0.121 0.903687    
## MSL         -4.015e-05  7.141e-05  -0.562 0.574226    
## T2          -5.310e-02  1.241e-01  -0.428 0.668872    
## D2           5.394e-01  1.172e-01   4.603 5.59e-06 ***
## U10          3.411e-02  2.584e-02   1.320 0.187504    
## V10          6.797e-02  2.430e-02   2.797 0.005406 ** 
## TP          -5.540e-01  5.302e-01  -1.045 0.296657    
## LCC         -8.296e-01  3.051e-01  -2.719 0.006821 ** 
## MCC         -3.327e-02  2.688e-01  -0.124 0.901581    
## HCC         -8.632e-02  2.069e-01  -0.417 0.676689    
## SKT          9.765e-02  6.050e-02   1.614 0.107305    
## RH_950       4.590e-02  1.360e-02   3.374 0.000811 ***
## T_950        4.304e-01  1.563e-01   2.754 0.006157 ** 
## RH_925      -1.996e-02  1.258e-02  -1.587 0.113247    
## T_925       -1.621e-01  1.501e-01  -1.080 0.280926    
## RH_850       9.344e-03  5.489e-03   1.702 0.089452 .  
## T_850        3.326e-02  6.126e-02   0.543 0.587477    
## RH_700      -1.310e-03  3.457e-03  -0.379 0.704838    
## T_700       -9.826e-03  4.800e-02  -0.205 0.837896    
## RH_500       7.553e-04  3.050e-03   0.248 0.804524    
## T_500        2.791e-02  3.202e-02   0.872 0.383929    
## T2_M1       -8.806e-02  1.092e-01  -0.806 0.420497    
## T_950_M1    -1.349e-01  1.218e-01  -1.108 0.268599    
## T_925_M1     4.932e-02  1.122e-01   0.439 0.660583    
## DECLINATION -4.616e-02  1.621e-02  -2.847 0.004644 ** 
## D2_M1        2.821e-01  1.008e-01   2.797 0.005400 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.438 on 404 degrees of freedom
## Multiple R-squared:  0.9447, Adjusted R-squared:  0.9413 
## F-statistic: 276.2 on 25 and 404 DF,  p-value: < 2.2e-16

Predicting D2 with the lm.model.D2

We now see the predicting ability of lm.model.D2 by using it to predict D2 for the test data D2_winter_00_08_test. The results show that the root mean square error (rmse) is 1.305783.

# calculating the fit
  fitted.values <- predict(object = lm.model.D2, newdata <- D2_winter_00_08_test)
  D2pred.lm <- fitted.values
  
  fitted.values <- as.numeric(fitted.values)
  residuals <- as.numeric(D2_winter_00_08_test[,1] - fitted.values)
  coefficients <- lm.model.D2$coefficients
  rmse.lm.D2 <- sqrt(mean(residuals^2))
  coefficients 
##   (Intercept)           MSL            T2            D2           U10 
## -1.369675e+00 -4.015087e-05 -5.309726e-02  5.394137e-01  3.411170e-02 
##           V10            TP           LCC           MCC           HCC 
##  6.797341e-02 -5.540012e-01 -8.295798e-01 -3.326516e-02 -8.632013e-02 
##           SKT        RH_950         T_950        RH_925         T_925 
##  9.764619e-02  4.589526e-02  4.304398e-01 -1.996047e-02 -1.621084e-01 
##        RH_850         T_850        RH_700         T_700        RH_500 
##  9.344400e-03  3.326178e-02 -1.310375e-03 -9.825790e-03  7.552656e-04 
##         T_500         T2_M1      T_950_M1      T_925_M1   DECLINATION 
##  2.791360e-02 -8.806281e-02 -1.349101e-01  4.932023e-02 -4.615555e-02 
##         D2_M1 
##  2.820616e-01
  rmse.lm.D2
## [1] 1.305783

Modelling T2 with lmstep

As explained earlier this algorithm develops a sequence of linear models, removing or adding a variable at each step. lm.step model starts with a formula known as the object.formula. We can define a lower and upper formulas and the model can go in step backward or forward to the lower or upper formulas from the object formula in some number of steps defined in the step variable. Here we define the object.formula as a constant one (no variables present), i.e., T2Obs ~1. The lower.formula is the same as the object.formula and the upper.formula is the formula with all the variables, i.e, T2Obs ~. We use only the forward direction.

With this model we get Residual standard error: 1.397, F-statistic: 531.2 and p-value: < 2.2e-16

 # generating the linear models
  response <- colnames(T2_winter_00_08_train)[1]
  object.formula <- as.formula(paste(response, " ~ 1", sep = ""))
  upper.formula <- as.formula(paste(response, " ~ .", sep = ""))
  lower.formula <- as.formula(paste(response, " ~ 1", sep = ""))
  object.lm <- lm(object.formula, data = T2_winter_00_08_train)
  upper.lm <- lm(upper.formula, data = T2_winter_00_08_train)
  lower.lm <- lm(lower.formula, data = T2_winter_00_08_train)
  direction <- "forward"
  steps <-  1000
  # choosing the best model
  step.model.T2 <- step(object = object.lm,
                     scope = list(upper = upper.lm, lower = lower.lm),
                     direction = direction, trace = 0 , steps = steps)
  step.model.T2
## 
## Call:
## lm(formula = T2Obs ~ T2 + T_950_M1 + LCC + D2 + V10 + T2_M1 + 
##     TP + RH_850 + T_500 + T_925 + D2_M1, data = T2_winter_00_08_train)
## 
## Coefficients:
## (Intercept)           T2     T_950_M1          LCC           D2  
##    9.823552     0.194058     0.209485    -1.606888     0.462476  
##         V10        T2_M1           TP       RH_850        T_500  
##    0.071111     0.296620    -1.175379     0.006691     0.034410  
##       T_925        D2_M1  
##   -0.084216    -0.138672
  summary(step.model.T2)
## 
## Call:
## lm(formula = T2Obs ~ T2 + T_950_M1 + LCC + D2 + V10 + T2_M1 + 
##     TP + RH_850 + T_500 + T_925 + D2_M1, data = T2_winter_00_08_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1462 -0.6120  0.0862  0.8179  4.5285 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.823552   4.529546   2.169  0.03066 *  
## T2           0.194058   0.101660   1.909  0.05696 .  
## T_950_M1     0.209485   0.058579   3.576  0.00039 ***
## LCC         -1.606888   0.254824  -6.306 7.30e-10 ***
## D2           0.462476   0.098471   4.697 3.59e-06 ***
## V10          0.071111   0.022136   3.212  0.00142 ** 
## T2_M1        0.296620   0.100797   2.943  0.00343 ** 
## TP          -1.175379   0.463145  -2.538  0.01152 *  
## RH_850       0.006691   0.003119   2.145  0.03250 *  
## T_500        0.034410   0.017177   2.003  0.04580 *  
## T_925       -0.084216   0.054377  -1.549  0.12220    
## D2_M1       -0.138672   0.092717  -1.496  0.13550    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.397 on 418 degrees of freedom
## Multiple R-squared:  0.9332, Adjusted R-squared:  0.9315 
## F-statistic: 531.2 on 11 and 418 DF,  p-value: < 2.2e-16

Predicting T2 with the lmstep model

We now see the predicting ability of lm.step.T2 by using it to predict T2 for the test data T2_winter_00_08_test. The results show that the root mean square error (rmse) is 1.422109.

# calculating the fit
  fitted.values <- predict(object = step.model.T2, newdata <- T2_winter_00_08_test)
  T2pred.lmstep <- fitted.values
  
  fitted.values <- as.numeric(fitted.values)
  residuals <- as.numeric(T2_winter_00_08_test[,1] - fitted.values)
  coefficients <- step.model.T2$coefficients
  rmse.lmstep.T2 <- sqrt(mean(residuals^2))
  coefficients
##  (Intercept)           T2     T_950_M1          LCC           D2 
##  9.823552427  0.194057743  0.209484845 -1.606888038  0.462475765 
##          V10        T2_M1           TP       RH_850        T_500 
##  0.071110983  0.296620115 -1.175378598  0.006691133  0.034409528 
##        T_925        D2_M1 
## -0.084215592 -0.138671532
  rmse.lmstep.T2
## [1] 1.422109

Modelling D2 with the lmstep model

Similar to the T2 model, we define the object.formula as a constant one (no variables present), i.e., D2Obs ~1. The lower.formula is the same as the object.formula and the upper.formula is the formula with all the variables, i.e, D2Obs ~. We use only the forward direction.

With this model we get Residual standard error: 1.428, F-statistic: 777.9 and p-value: < 2.2e-16. Notice that T2 is not present in the final formula which gave the best AIC.

 # generating the linear models
  response <- colnames(D2_winter_00_08_train)[1]
  object.formula <- as.formula(paste(response, " ~ 1", sep = ""))
  upper.formula <- as.formula(paste(response, " ~ .", sep = ""))
  lower.formula <- as.formula(paste(response, " ~ 1", sep = ""))
  object.lm <- lm(object.formula, data = D2_winter_00_08_train)
  upper.lm <- lm(upper.formula, data = D2_winter_00_08_train)
  lower.lm <- lm(lower.formula, data = D2_winter_00_08_train)
  direction <- "forward"
  steps <-  1000
  # choosing the best model
  step.model.D2 <- step(object = object.lm,
                     scope = list(upper = upper.lm, lower = lower.lm),
                     direction = direction, trace = 0 , steps = steps)
 summary(step.model.D2) 
## 
## Call:
## lm(formula = D2Obs ~ D2 + U10 + DECLINATION + T_950 + RH_950 + 
##     LCC + V10 + D2_M1 + T_950_M1, data = D2_winter_00_08_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0791 -0.6686 -0.0301  0.8325  5.2744 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.239479   5.164038  -0.434 0.664752    
## D2           0.561928   0.082665   6.798 3.66e-11 ***
## U10          0.043221   0.023492   1.840 0.066496 .  
## DECLINATION -0.042510   0.014883  -2.856 0.004500 ** 
## T_950        0.333876   0.086727   3.850 0.000137 ***
## RH_950       0.031571   0.006411   4.924 1.22e-06 ***
## LCC         -0.632813   0.267905  -2.362 0.018628 *  
## V10          0.062783   0.022688   2.767 0.005903 ** 
## D2_M1        0.244156   0.069842   3.496 0.000523 ***
## T_950_M1    -0.142431   0.081859  -1.740 0.082600 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.428 on 420 degrees of freedom
## Multiple R-squared:  0.9434, Adjusted R-squared:  0.9422 
## F-statistic: 777.9 on 9 and 420 DF,  p-value: < 2.2e-16

Predicting D2 with the lmstep model

We now see the predicting ability of lm.step.D2 by using it to predict D2 for the test data D2_winter_00_08_test. The results show that the root mean square error (rmse) is 1.334799.

# calculating the fit
  fitted.values <- predict(object = step.model.D2, newdata <- D2_winter_00_08_test)
  D2pred.lmstep <- fitted.values
  
  fitted.values <- as.numeric(fitted.values)
  residuals <- as.numeric(D2_winter_00_08_test[,1] - fitted.values)
  coefficients <- step.model.D2$coefficients
  rmse.lmstep.D2 <- sqrt(mean(residuals^2))
  coefficients
## (Intercept)          D2         U10 DECLINATION       T_950      RH_950 
## -2.23947895  0.56192768  0.04322126 -0.04250985  0.33387578  0.03157108 
##         LCC         V10       D2_M1    T_950_M1 
## -0.63281298  0.06278252  0.24415608 -0.14243140
  rmse.lmstep.D2
## [1] 1.334799

Modelling with glm function and cross validation cv.glm function

We use the glm function in the boot package. We use K = 4 folds, the data for training will be around 300 observations and for cross validation is around 100 observations. The cross validation error is smallest for the first iteration.

library(boot)
set.seed(123)
cv.error.10 = rep(0, 10)
for (i in 1:10) {
    glm.fit.T2 = glm(T2Obs ~., data = T2_winter_00_08_train)
    cv.error.10[i] = cv.glm(T2_winter_00_08_train, glm.fit.T2, K = 4)$delta[1]
cv.error.10
}
cv.error.10
##  [1] 2.143964 2.263606 2.187350 2.202544 2.285312 2.162794 2.227936
##  [8] 2.363532 2.399627 2.274742
summary(glm.fit.T2)
## 
## Call:
## glm(formula = T2Obs ~ ., data = T2_winter_00_08_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.2034  -0.6131   0.1082   0.7946   4.4716  
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.788e+01  1.111e+01   1.609 0.108370    
## MSL         -4.625e-05  7.012e-05  -0.660 0.509912    
## T2           1.954e-01  1.218e-01   1.604 0.109479    
## D2           4.390e-01  1.151e-01   3.815 0.000158 ***
## U10          1.569e-02  2.537e-02   0.618 0.536789    
## V10          6.873e-02  2.387e-02   2.880 0.004192 ** 
## TP          -1.122e+00  5.206e-01  -2.155 0.031723 *  
## LCC         -1.438e+00  2.996e-01  -4.800 2.24e-06 ***
## MCC         -9.167e-02  2.640e-01  -0.347 0.728581    
## HCC          1.279e-02  2.031e-01   0.063 0.949846    
## SKT         -1.651e-02  5.941e-02  -0.278 0.781277    
## RH_950       1.252e-02  1.336e-02   0.938 0.349002    
## T_950        2.053e-01  1.535e-01   1.338 0.181739    
## RH_925      -1.790e-02  1.235e-02  -1.449 0.148033    
## T_925       -3.156e-01  1.474e-01  -2.141 0.032904 *  
## RH_850       1.170e-02  5.390e-03   2.171 0.030530 *  
## T_850        3.434e-02  6.016e-02   0.571 0.568491    
## RH_700      -4.971e-04  3.395e-03  -0.146 0.883646    
## T_700        1.454e-02  4.713e-02   0.308 0.757918    
## RH_500       2.578e-04  2.995e-03   0.086 0.931455    
## T_500        2.594e-02  3.145e-02   0.825 0.409956    
## T2_M1        3.035e-01  1.072e-01   2.830 0.004885 ** 
## T_950_M1     1.147e-01  1.196e-01   0.959 0.338050    
## T_925_M1     8.026e-02  1.102e-01   0.728 0.466943    
## DECLINATION  1.130e-02  1.592e-02   0.710 0.478269    
## D2_M1       -1.190e-01  9.902e-02  -1.201 0.230269    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 1.99557)
## 
##     Null deviance: 12225.28  on 429  degrees of freedom
## Residual deviance:   806.21  on 404  degrees of freedom
## AIC: 1544.6
## 
## Number of Fisher Scoring iterations: 2

Predicting T2 with the glm model

We now see the predicting ability of glmcv.T2 by using it to predict T2 for the test data D2_winter_00_08_test. The results show that the root mean square error (rmse) is 1.411888.

# calculating the fit
  fitted.values <- predict(object = glm.fit.T2, newdata <- T2_winter_00_08_test)
  T2pred.glmcv <- fitted.values
  fitted.values <- as.numeric(fitted.values)
  residuals <- as.numeric(T2_winter_00_08_test[,1] - fitted.values)
  coefficients <- glm.fit.T2$coefficients
  rmse.glmcv.T2 <- sqrt(mean(residuals^2))
  coefficients 
##   (Intercept)           MSL            T2            D2           U10 
##  1.787541e+01 -4.624988e-05  1.954199e-01  4.390128e-01  1.568595e-02 
##           V10            TP           LCC           MCC           HCC 
##  6.872884e-02 -1.122150e+00 -1.437913e+00 -9.167406e-02  1.278564e-02 
##           SKT        RH_950         T_950        RH_925         T_925 
## -1.650709e-02  1.252312e-02  2.053390e-01 -1.789853e-02 -3.156258e-01 
##        RH_850         T_850        RH_700         T_700        RH_500 
##  1.170128e-02  3.433758e-02 -4.971310e-04  1.453740e-02  2.577569e-04 
##         T_500         T2_M1      T_950_M1      T_925_M1   DECLINATION 
##  2.594027e-02  3.035225e-01  1.147081e-01  8.025665e-02  1.130134e-02 
##         D2_M1 
## -1.189730e-01
  rmse.glmcv.T2
## [1] 1.411888

D2 model boot library to use cv.glm function

Similar to T2 modelling using glm we carry out the modelling for D2

library(boot)
set.seed(123)
cv.error.10 = rep(0, 10)
for (i in 1:10) {
    glm.fit.D2 = glm(D2Obs ~., data = D2_winter_00_08_train)
    cv.error.10[i] = cv.glm(D2_winter_00_08_train, glm.fit.D2, K = 5)$delta[1]
cv.error.10
}
cv.error.10
##  [1] 2.232088 2.253692 2.297469 2.343139 2.303318 2.279148 2.321005
##  [8] 2.407905 2.353452 2.230696
summary(glm.fit.D2)
## 
## Call:
## glm(formula = D2Obs ~ ., data = D2_winter_00_08_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.1028  -0.7146  -0.0346   0.8332   5.1312  
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.370e+00  1.131e+01  -0.121 0.903687    
## MSL         -4.015e-05  7.141e-05  -0.562 0.574226    
## T2          -5.310e-02  1.241e-01  -0.428 0.668872    
## D2           5.394e-01  1.172e-01   4.603 5.59e-06 ***
## U10          3.411e-02  2.584e-02   1.320 0.187504    
## V10          6.797e-02  2.430e-02   2.797 0.005406 ** 
## TP          -5.540e-01  5.302e-01  -1.045 0.296657    
## LCC         -8.296e-01  3.051e-01  -2.719 0.006821 ** 
## MCC         -3.327e-02  2.688e-01  -0.124 0.901581    
## HCC         -8.632e-02  2.069e-01  -0.417 0.676689    
## SKT          9.765e-02  6.050e-02   1.614 0.107305    
## RH_950       4.590e-02  1.360e-02   3.374 0.000811 ***
## T_950        4.304e-01  1.563e-01   2.754 0.006157 ** 
## RH_925      -1.996e-02  1.258e-02  -1.587 0.113247    
## T_925       -1.621e-01  1.501e-01  -1.080 0.280926    
## RH_850       9.344e-03  5.489e-03   1.702 0.089452 .  
## T_850        3.326e-02  6.126e-02   0.543 0.587477    
## RH_700      -1.310e-03  3.457e-03  -0.379 0.704838    
## T_700       -9.826e-03  4.800e-02  -0.205 0.837896    
## RH_500       7.553e-04  3.050e-03   0.248 0.804524    
## T_500        2.791e-02  3.202e-02   0.872 0.383929    
## T2_M1       -8.806e-02  1.092e-01  -0.806 0.420497    
## T_950_M1    -1.349e-01  1.218e-01  -1.108 0.268599    
## T_925_M1     4.932e-02  1.122e-01   0.439 0.660583    
## DECLINATION -4.616e-02  1.621e-02  -2.847 0.004644 ** 
## D2_M1        2.821e-01  1.008e-01   2.797 0.005400 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 2.069254)
## 
##     Null deviance: 15126.62  on 429  degrees of freedom
## Residual deviance:   835.98  on 404  degrees of freedom
## AIC: 1560.2
## 
## Number of Fisher Scoring iterations: 2

Predicting D2

We now see the predicting ability of glmcv.D2 by using it to predict D2 for the test data D2_winter_00_08_test. The results show that the root mean square error (rmse) is 1.305783.

# calculating the fit
  fitted.values <- predict(object = glm.fit.D2, newdata <- D2_winter_00_08_test)
   D2pred.glmcv <- fitted.values
  fitted.values <- as.numeric(fitted.values)
  residuals <- as.numeric(D2_winter_00_08_test[,1] - fitted.values)
  coefficients <- glm.fit.D2$coefficients
  rmse.glmcv.D2 <- sqrt(mean(residuals^2))
  coefficients 
##   (Intercept)           MSL            T2            D2           U10 
## -1.369675e+00 -4.015087e-05 -5.309726e-02  5.394137e-01  3.411170e-02 
##           V10            TP           LCC           MCC           HCC 
##  6.797341e-02 -5.540012e-01 -8.295798e-01 -3.326516e-02 -8.632013e-02 
##           SKT        RH_950         T_950        RH_925         T_925 
##  9.764619e-02  4.589526e-02  4.304398e-01 -1.996047e-02 -1.621084e-01 
##        RH_850         T_850        RH_700         T_700        RH_500 
##  9.344400e-03  3.326178e-02 -1.310375e-03 -9.825790e-03  7.552656e-04 
##         T_500         T2_M1      T_950_M1      T_925_M1   DECLINATION 
##  2.791360e-02 -8.806281e-02 -1.349101e-01  4.932023e-02 -4.615555e-02 
##         D2_M1 
##  2.820616e-01
  rmse.glmcv.D2
## [1] 1.305783

rmse values for my.model lm, lmstep and glmcv predictions of T2 and D2

We can see that my.model for T2 and D2 have comparable rmse values to that obtained using the lm, lmstep and glmcv.

 cat("rmse mymodel T2=", rmse.mymodel.T2, "rmse lm T2=", rmse.lm.T2, "rmse lmstep T2=", rmse.lmstep.T2, "rmse glmcv T2=", rmse.glmcv.T2,
     "rmse mymodel D2=", rmse.mymodel.D2, "rmse lm D2=", rmse.lm.D2,"rmse lmstep D2=", rmse.lmstep.D2, "rmse glmcv D2=", rmse.glmcv.D2
 )
## rmse mymodel T2= 1.462304 rmse lm T2= 1.411888 rmse lmstep T2= 1.422109 rmse glmcv T2= 1.411888 rmse mymodel D2= 1.351384 rmse lm D2= 1.305783 rmse lmstep D2= 1.334799 rmse glmcv D2= 1.305783

The diagnostic plots

We plot the following graphs Residuals vs Fitted values, Normal QQ-plot, Standardized residuals vs Fitted values and Residuals vs Leverage for all the models.

The Residuals vs Fitted values plot shows if the residuals have non-linear patterns. There could be a non-linear relationship between predictor variables and the response variable and the pattern could show up in this plot if the model is not able to capture the non-linear relationship. Ideally there should not be any pattern and the residuals should be randomly spread around a horizontal line drawn at 0 on the y-axis. A red line in the graph helps to show the trend in the pattern, its deviation from the horizontal line at 0.

The second graph, the normal QQ-plot shows if the residuals are normally distributed. If the residuals follow the straight dashed line, the residuals are normally distributed.

The third graph, Standardized residuals vs Fitted values, shows the rmse in the predictions. The third graph, Residuals vs Leverage graph helps to find the influential extreme values or outliers. The data have extreme values, but we can consider these extreme values are not influential if the linear regression model we find does not change if we either include or exclude them from analysis. The red dotted line in the graph shows the Cook’s distance. If all the points in the leverage graph are well within the the Cook’s distance, then there are no outliers that influence the linear regression.

We plot the diagnostic plots for our linear regression models, mymodel, lm, lmstep and glmcv. For mymodel, the Residuals vs Fitted values shows most of the residuals are along the mean value. The normal QQ plot is linear and the leverage plot do not have outliers. These graphs show that the mymodel is a good fit for the weather data. Similar graphs for lm, lmstep and glmcv show that they also produce good linear regression models for the weather data.

# We plot the graphs  Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage.

par(mfrow = c(2,2))
plot(my.model.T2, which = c(1,2,3,5), caption = list("T2-my model  Residuals vs Fitted values", "T2-mymodel Normal QQ-plot", "T2 Standardized residuals vs Fitted values", "T2-mymodel Residuals vs Leverage"))

par(mfrow = c(2,2))
plot(my.model.D2, which = c(1,2,5), caption = list("D2-my model Residuals vs Fitted values", "D2-lm Normal QQ-plot", "D2-mymodel Residuals vs Leverage"))


par(mfrow = c(2,2))

plot(lm.model.T2, which = c(1,2,5), caption = list("T2-lm  Residuals vs Fitted values", "T2-lm Normal QQ-plot", "T2-lm Residuals vs Leverage"))

par(mfrow = c(2,2))

plot(lm.model.D2, which = c(1,2,5), caption = list("D2-lm Residuals vs Fitted values", "D2-lm Normal QQ-plot", "D2-lm Residuals vs Leverage"))

par(mfrow = c(2,2))

plot(step.model.T2, which = c(1,2,5), caption = list("T2-lmstep  Residuals vs Fitted values", "T2-lmstep Normal QQ-plot", "T2-lmstep Residuals vs Leverage"))

par(mfrow = c(2,2))

plot(step.model.D2, which = c(1,2,5), caption = list("D2-lmstep Residuals vs Fitted values", "D2-lmstep Normal QQ-plot", "D2-lmstep Residuals vs Leverage"))

par(mfrow = c(2,2))

plot(glm.fit.T2, which = c(1,2,5), caption = list("T2-glm Residuals vs Fitted values", "T2-glm Normal QQ-plot", "T2-glm Residuals vs Leverage"))

par(mfrow = c(2,2))

plot(glm.fit.D2, which = c(1,2,5), caption = list("D2-glm Residuals vs Fitted values", "D2-glm Normal QQ-plot", "D2-glm Residuals vs Leverage"))

## RH predictions

The aim of my assignment was to find the interdependence of the variables T2 and D2. We find this by calculating the Relative humidity which is a function of T2 and D2.

In this section, we calculate the RH values using the T2Obs, D2Obs, T2 and D2 from the ECMWF model and T2 and D2 obtained from the linear regression models mymodel, lm, lmstep and glmcv

  T2Obs.test <- T2_winter_00_08_test[,1]
  D2Obs.test <- D2_winter_00_08_test[,1]
  
  
  T2model.test <- T2_winter_00_08_test[,3]
  D2model.test <- D2_winter_00_08_test[,4]
  
 # RH calculation

  # RH  from T2Obs and D2Obs
  
  saturation_humidity_obs <-  6.112*exp((17.62*T2Obs.test)/(T2Obs.test + 243.12))
  specific_humidity_obs <-  6.112*exp((17.62*D2Obs.test)/(D2Obs.test + 243.12))
  RHObs <- 100*(specific_humidity_obs/saturation_humidity_obs)
  
  # RH  from T2 and D2 from ECMWF model
  
  saturation_humidity_model <-  6.112*exp((17.62*T2model.test)/(T2Obs.test + 243.12))
  specific_humidity_model <-  6.112*exp((17.62*D2model.test)/(D2Obs.test + 243.12))
  RHmodel <- 100*(specific_humidity_model/saturation_humidity_model)

T2 and D2 predicted using mymodel, lm, lmstep and glmcv

 # T2 and D2 predicted using mymodel

  saturation_humidity_pred_mymodel <-  6.112*exp((17.62*T2pred.mymodel)/(T2pred.mymodel + 243.12))
  specific_humidity_pred_mymodel <-  6.112*exp((17.62*D2pred.mymodel)/(D2pred.mymodel + 243.12))
  RHpred.mymodel <- 100*(specific_humidity_pred_mymodel/saturation_humidity_pred_mymodel)

  # T2 and D2 predicted using lm
  
  saturation_humidity_pred_lm <-  6.112*exp((17.62*T2pred.lm)/(T2pred.lm + 243.12))
  specific_humidity_pred_lm <-  6.112*exp((17.62*D2pred.lm)/(D2pred.lm + 243.12))
  RHpred.lm <- 100*(specific_humidity_pred_lm/saturation_humidity_pred_lm)

  # T2 and D2 predicted using lmstep 
  
  saturation_humidity_pred_lmstep <-  6.112*exp((17.62*T2pred.lmstep)/(T2pred.lmstep + 243.12))
  specific_humidity_pred_lmstep <-  6.112*exp((17.62*D2pred.lmstep)/(D2pred.lmstep + 243.12))
  RHpred.lmstep <- 100*(specific_humidity_pred_lmstep/saturation_humidity_pred_lmstep)

   # T2 and D2 predicted using glm cross validation 
  
  saturation_humidity_pred_glmcv <-  6.112*exp((17.62*T2pred.glmcv)/(T2pred.glmcv + 243.12))
  specific_humidity_pred_glmcv <-  6.112*exp((17.62*D2pred.glmcv)/(D2pred.glmcv + 243.12))
  RHpred.glmcv <- 100*(specific_humidity_pred_lmstep/saturation_humidity_pred_glmcv)

The rmse in the RH value calculated using the T2 and D2 observations and model values

We get the following rmse values for the different models rmse RH ECMWF model = 4.823471 rmse RM mymodel = 2.000381 rmse RM lm = 2.029639 rmse RH lmstep = 1.980861 rmse RH glmcv = 1.983662.

We can see that the ECMWF model has high rmse in RH value compared to all other models. The model we developed “mymodel” has comparable rmse values for RH than that of lm, lmstep and glmcv models. This experiment shows that the statistical preprocessing improves the forecast as it reduces the rmse compared to the raw model forecast.

Recall that the rmse for T2 and D2 for the linear regression were around 1.4 in all the models. The rmse for RH is much for than (even double) that of T2 and D2.

  rmse.RH.model <- sqrt(mean((RHmodel - RHObs)^2))
  rmse.RH.mymodel <- sqrt(mean((RHpred.mymodel - RHObs)^2))
  rmse.RH.lm <- sqrt(mean((RHpred.lm - RHObs)^2))
  rmse.RH.lmstep <- sqrt(mean((RHpred.lmstep - RHObs)^2))
  rmse.RH.glmcv <- sqrt(mean((RHpred.glmcv - RHObs)^2))
  cat("rmse RH ECMWF model = ", rmse.RH.model, "rmse RM mymodel = ", rmse.RH.mymodel, "rmse RM lm = ", rmse.RH.lm, "rmse RH lmstep = ", rmse.RH.lmstep, "rmse RH glmcv = ", rmse.RH.glmcv)
## rmse RH ECMWF model =  4.823471 rmse RM mymodel =  2.000381 rmse RM lm =  1.920225 rmse RH lmstep =  1.980861 rmse RH glmcv =  1.983662

Linear regression models for Kumpula station

This section is divided into (1) model prediction and (2) analysis and results. We predict the linear regression model for the Kumpula station for the four seasons and for the 64 forecast periods. For ease of analysis we use only the forecasts that are at 00:00 UTC (analysis_time is 1). We use the simple linear regression(lm), linear regression with step method and linear regression with cross validation (glm.cv).

The linear regression model takes two response variables namely T2Obs and D2Obs, the observations of air temperature and dewpoint temperature at 2 meters. As the predictors we use the ECMWF model outputs. All these variables are explained in the Section on weather data. Using the linear regression models, mymodel, lm, lmstep and glmcv, we predict the T2 and D2, the air temperature and dewpoint temperature for 64 forecast periods. We then calculate the relative humidity, RH, using the predicted T2 and D2 using the four linear regression models. We also calculate the RH using the T2 and D2 from the ECMWF raw model. The relative humidity is calculated using some standard formulas.

The R program is done in a modular way with functions.

Linear model prediction

# Packages 
# reshape package for melt operation 
library(reshape2)

# glm and glm.cv
library(boot)

FitWithmymodel <- function(training.set, test.set = NULL, parameter) {
  # Estimates a  linear model using "mymodel" we developed and predicts with the model.
  #
  # Args:
  #   training.set:
  #   test.set:
  #   parameter shows whether we want to model for T2 or D2.
  #   introducing the parameter is a simple way to select the formula that is to be 
  #   used for T2 and D2
  #
  # Returns:
  #   A list of coefficients, residuals and fitted values.
  
  response <- colnames(training.set)[1]
  if (parameter == 1) { 
    model <- lm(T2Obs ~ T2 + D2 + T_950 + T_925 + T2_M1 + D2_M1 + SKT + LCC, data = training.set )
  }
  else if (parameter == 2) {
   model <- lm(D2Obs ~ T2 + D2 + T_950 + T_950_M1 + D2_M1 + TP + RH_950 + V10 + LCC, data = training.set)
  }
  
  
  if (is.null(test.set)) {
    test.set <- training.set
  }
  
  fitted.values <- predict(object = model, newdata = test.set)
  residuals <- test.set[,1] - fitted.values
  
  fitted.values <- as.numeric(fitted.values)
  residuals <- as.numeric(residuals)
  
  results <- list("coefficients" = model$coefficients,
                  "residuals" = residuals,
                  "fitted.values" = fitted.values)
  
  class(results) <- "simplelm"
  
  return(results)
}

FitWithLM <- function(training.set, test.set = NULL) {
  # Estimates a full linear model and predicts with the model.
  #
  # Args:
  #   training.set:
  #   test.set:
  #
  # Returns:
  #   A list of coefficients, residuals and fitted values.
  
  response <- colnames(training.set)[1]
  formula <- paste(response, " ~ .", sep = "")
  model <- lm(formula = formula, data = training.set)
  
  if (is.null(test.set)) {
    test.set <- training.set
  }
  
  fitted.values <- predict(object = model, newdata = test.set)
  residuals <- test.set[,1] - fitted.values
  
  fitted.values <- as.numeric(fitted.values)
  residuals <- as.numeric(residuals)
  
  results <- list("coefficients" = model$coefficients,
                  "residuals" = residuals,
                  "fitted.values" = fitted.values)
  
  class(results) <- "simplelm"
  
  return(results)
}

FitWithStep <- function(training.set, test.set = NULL,
                        object.formula, upper.formula, lower.formula,
                        direction, steps = 1000) {
  
  
  # Chooses a linear model by a stepwise algorithm and predicts a fit on an independant test data
  #
  # Args:
  #   training.set: a data matrix
  #   test.set: a data matrix
  #   object.formula: a formula viable for a linear model
  #   upper.formula: a formula viable for a linear model
  #   lower.formula: a formula viable for a linear model
  #   direction: both/forward/backward
  #   steps: the number of steps the algorithm is allowed to take
  # 
  # Returns:
  #   A list of coefficients, residuals and fitted values.
  
  # saving the training and the test set to global environment
  training.set <- training.set
  test.set <- test.set
  
  # generating the linear models
  object.lm <- lm(object.formula, data = training.set)
  upper.lm <- lm(upper.formula, data = training.set)
  lower.lm <- lm(lower.formula, data = training.set)
  
  # choosing the best model
  step.model <- step(object = object.lm,
                     scope = list(upper = upper.lm, lower = lower.lm),
                     direction = direction, trace = 0 , steps = steps)
  
  # if there are no test data, the fit is calculated using the training data
  if (is.null(test.set)) {
    test.set <- training.set
  }
  
  # calculating the fit
  fitted.values <- predict(object = step.model, newdata <- test.set)
  
  fitted.values <- as.numeric(fitted.values)
  residuals <- as.numeric(test.set[,1] - fitted.values)
  coefficients <- step.model$coefficients
  
  results <- list("coefficients" = coefficients,
                  "residuals" = residuals,
                  "fitted.values" = fitted.values)
  
  class(results) <- "simplelm"
  return(results)
}


FitWithglmcv<- function(training.set, test.set = NULL, k) {
  # Estimates a full glm linear model with cross validation  and predicts with the model.
  #
  # Args:
  #   training.set:
  #   test.set:
  #   number of folds
  # Returns:
  #   A list of coefficients, residuals and fitted values.
  
  response <- colnames(training.set)[1]
  formula <- paste(response, " ~ .", sep = "")
  model <- lm(formula = formula, data = training.set)
  
  set.seed(123)
  cv.error.10 = rep(0, 10)
  for (i in 1:10) {
    model  = glm(formula = formula, data = training.set)
    cv.error.10[i] = cv.glm(training.set, model, K = k)$delta[1]
    cv.error.10
}

    
  if (is.null(test.set)) {
    test.set <- training.set
  }
  
  fitted.values <- predict(object = model, newdata = test.set)
  residuals <- test.set[,1] - fitted.values
  
  fitted.values <- as.numeric(fitted.values)
  residuals <- as.numeric(residuals)
  
  results <- list("coefficients" = model$coefficients,
                  "residuals" = residuals,
                  "fitted.values" = fitted.values)
  
  class(results) <- "simplelm"
  
  return(results)
}


ModelsOfAlgorithms <- function(training.set, test.set, parameter) {
  # Models generated using different algorithms, currently with lm, lmstep and glmcv
  # 
  #
  # Args:
  #   training.set:
  #   test.set:
  #
  # Returns:
  # A list of elements ' fitted_models', 'residuals', and 'response' variable
  
  # model details 
  response <- colnames(training.set)[1]
  fInitial <- as.formula(paste(response, " ~ 1", sep = ""))
  fFull <- as.formula(paste(response, " ~ .", sep = ""))
  
  # Create a list to store the model results for T2 and D2 observations
  fitted.values.by.algorithm <- as.data.frame(matrix(NA, ncol = 0, 
                                                        nrow=nrow(test.set)))
  residuals.by.algorithm <- as.data.frame(matrix(NA, ncol = 0,
                                                    nrow = nrow(test.set)))
 
  elapsed.time.by.algorithm <- NULL
 
   # mymodel
    tmp.time.result <- system.time(
    tmp.result <- FitWithmymodel(
                    training.set, test.set, parameter
                  )
  )
  
  fitted.values.by.algorithm$mymodel <- tmp.result$fitted.values
  residuals.by.algorithm$mymodel  <- tmp.result$residuals
  elapsed.time.by.algorithm <- c(elapsed.time.by.algorithm,
                                mymodel = tmp.time.result[[3]])
  
  
  
  # lm 
  tmp.time.result <- system.time(
    tmp.result <- FitWithLM(
                    training.set, test.set
                  )
  )
  
  fitted.values.by.algorithm$lm <- tmp.result$fitted.values
  residuals.by.algorithm$lm <- tmp.result$residuals
  elapsed.time.by.algorithm <- c(elapsed.time.by.algorithm,
                                lm = tmp.time.result[[3]])
  
  
  
  # lm step model 
  
  tmp.time.result <- system.time(
    tmp.result <- FitWithStep(
                    training.set, 
                    test.set, 
                    fInitial, fFull, fInitial, 
                    direction = "forward", steps = 1000)
  )
  
  fitted.values.by.algorithm$lmstep <- tmp.result$fitted.values
  residuals.by.algorithm$lmstep <- tmp.result$residuals
  elapsed.time.by.algorithm<- c(elapsed.time.by.algorithm,
                                    lmstep =tmp.time.result[[3]])
  
 
  # glm cv model
   
  tmp.time.result <- system.time(
    tmp.result <- FitWithglmcv(
      training.set, 
      test.set, 
      k = 4
      )
  )
  
  fitted.values.by.algorithm$glmcv <- tmp.result$fitted.values
  residuals.by.algorithm$glmcv <- tmp.result$residuals
  elapsed.time.by.algorithm <- c(elapsed.time.by.algorithm,
                                 glmcv = tmp.time.result[[3]])
  
   results  <-  list(fitted.values = fitted.values.by.algorithm,
                      residuals = residuals.by.algorithm,
                      elapsed.time = elapsed.time.by.algorithm)
  
  
  return(results)
}


  
CalculateRH <- function(wdata) {
  # Calculates the relative humidity RH using various fitting algorithms
  #
  # Args
  #   df: dataframe 
  #
  # Returns:
  #   rmse.pred.by.model: rmse of lmstep and glmcv predictions
  #   rmse.RH.by.model: rmse RH values from the ECMWF, lm, lmstep and glmcv
  
   # Create a list to store the rmse values for predictions and RH using the models values, 
   # predicted values using lm, lmstep, glmcv
  rmse.pred.by.model <- NULL
  rmse.RH.by.model <- NULL
 
  # Separating T2 and D2 and generating T2_Obs and model data as well as D2_Obs and model data
  
  T2_Obs_model_data <- wdata[, c(6:(ncol(wdata)-1))]
  D2_Obs_model_data <- cbind(wdata[ncol(wdata)], wdata[,7:(ncol(wdata)-1)])

  
  
   # Splitting into train and test data
  n <- nrow(T2_Obs_model_data)
  train = sample(1:n, size = round(0.75*n), replace=FALSE)
  T2_Obs_model_data_train <- T2_Obs_model_data[train,]
  T2_Obs_model_data_test <-  T2_Obs_model_data[-train,]
  D2_Obs_model_data_train <- D2_Obs_model_data[train,]
  D2_Obs_model_data_test <-  D2_Obs_model_data[-train,]
  
   
  # Models and predictions for T2 and D2. Parameter value selects the formula for mymodel
  
  results.T2 <- ModelsOfAlgorithms(T2_Obs_model_data_train, T2_Obs_model_data_test, parameter = 1)
  results.D2 <- ModelsOfAlgorithms(D2_Obs_model_data_train, D2_Obs_model_data_test, parameter = 2)
  
  
  # T2 and D2 observation and model data  
  T2_obs_test <- T2_Obs_model_data_test[,1]
  D2_obs_test <- D2_Obs_model_data_test[,1]
  
  T2_model_test <- T2_Obs_model_data_test[,3]
  D2_model_test <- D2_Obs_model_data_test[,4]
  
  
  # T2 and D2 from predictions using mymodel, lm, lmstep and glmcv
  
  
  T2_pred_mymodel <- results.T2$fitted.values$mymodel
  T2_pred_lm <- results.T2$fitted.values$lm
  T2_pred_lmstep <- results.T2$fitted.values$lmstep
  T2_pred_glmcv<- results.T2$fitted.values$glmcv

  D2_pred_mymodel <- results.D2$fitted.values$mymodel
  D2_pred_lm <- results.D2$fitted.values$lms
  D2_pred_lmstep <- results.D2$fitted.values$lmstep
  D2_pred_glmcv<- results.D2$fitted.values$glmcv

  # rmse for T2 and  D2 obtained using the rawmodel, mymodel, lm, lmstep and glmcv  models
  
  rmse.T2.raw <- sqrt(mean((T2_obs_test - T2_model_test)^2))
  rmse.T2.mymodel <- sqrt(mean(results.T2$residuals$mymodel^2))
  rmse.T2.lm <- sqrt(mean(results.T2$residuals$lm^2))
  rmse.T2.lm <- sqrt(mean(results.T2$residuals$lm^2))
  rmse.T2.lmstep <- sqrt(mean(results.T2$residuals$lmstep^2))
  rmse.T2.glmcv <- sqrt(mean(results.T2$residuals$glmcv^2))
 
  rmse.D2.raw <- sqrt(mean((D2_obs_test - D2_model_test)^2))
  rmse.D2.mymodel <- sqrt(mean(results.D2$residuals$mymodel^2))
  rmse.D2.lm <- sqrt(mean(results.D2$residuals$lm^2))
  rmse.D2.lmstep <- sqrt(mean(results.D2$residuals$lmstep^2))
  rmse.D2.glmcv <- sqrt(mean(results.D2$residuals$glmcv^2))

   rmse.pred.by.model<- c(rmse.pred.by.model,
                        T2raw = rmse.T2.raw,
                        T2mymodel = rmse.T2.mymodel,
                        T2lm = rmse.T2.lm,
                        T2lmstep = rmse.T2.lmstep,
                        T2glmcv = rmse.T2.glmcv,
                        D2raw = rmse.D2.raw,
                        D2mymodel = rmse.D2.mymodel,
                        D2lm = rmse.D2.lm,
                        D2lmstep = rmse.D2.lmstep,
                        D2glmcv = rmse.D2.glmcv
                    )
   
   
   
  # Relative humidity (RH) calculation
   
  saturation_humidity_obs <-  6.112*exp((17.62*T2_obs_test)/(T2_obs_test + 243.12))
  specific_humidity_obs <-  6.112*exp((17.62*D2_obs_test)/(D2_obs_test + 243.12))
  RH_obs <- 100*(specific_humidity_obs/saturation_humidity_obs)
 
  saturation_humidity_model <-  6.112*exp((17.62*T2_model_test)/(T2_obs_test + 243.12))
  specific_humidity_model <-  6.112*exp((17.62*D2_model_test)/(D2_obs_test + 243.12))
  RH_model <- 100*(specific_humidity_model/saturation_humidity_model)
  
  # predicted T2, D2 using mymodel, lm, lmstep and glmcv
  
  saturation_humidity_pred_mymodel <-  6.112*exp((17.62*T2_pred_mymodel)/(T2_pred_mymodel + 243.12))
  specific_humidity_pred_mymodel <-  6.112*exp((17.62*D2_pred_mymodel)/(D2_pred_mymodel + 243.12))
  RH_pred_mymodel <- 100*(specific_humidity_pred_mymodel/saturation_humidity_pred_mymodel)
  
  saturation_humidity_pred_lm <-  6.112*exp((17.62*T2_pred_lm)/(T2_pred_lm + 243.12))
  specific_humidity_pred_lm <-  6.112*exp((17.62*D2_pred_lm)/(D2_pred_lm + 243.12))
  RH_pred_lm <- 100*(specific_humidity_pred_lm/saturation_humidity_pred_lm)
  
  saturation_humidity_pred_lmstep <-  6.112*exp((17.62*T2_pred_lmstep)/(T2_pred_lmstep + 243.12))
  specific_humidity_pred_lmstep <-  6.112*exp((17.62*D2_pred_lmstep)/(D2_pred_lmstep + 243.12))
  RH_pred_lmstep <- 100*(specific_humidity_pred_lmstep/saturation_humidity_pred_lmstep)
  
  
  saturation_humidity_pred_glmcv <-  6.112*exp((17.62*T2_pred_glmcv)/(T2_pred_glmcv + 243.12))
  specific_humidity_pred_glmcv <-  6.112*exp((17.62*D2_pred_glmcv)/(D2_pred_glmcv + 243.12))
  RH_pred_glmcv <- 100*(specific_humidity_pred_glmcv/saturation_humidity_pred_glmcv)
  
  
  # rmse for RH
  rmse_RH_model <- sqrt(mean((RH_model - RH_obs)^2))
  rmse_RH_mymodel <- sqrt(mean((RH_pred_mymodel - RH_obs)^2))
  rmse_RH_lm <- sqrt(mean((RH_pred_lm - RH_obs)^2))
  rmse_RH_lmstep <- sqrt(mean((RH_pred_lmstep - RH_obs)^2))
  rmse_RH_glmcv <- sqrt(mean((RH_pred_glmcv - RH_obs)^2))
 
  
  rmse.RH.by.model<- c(rmse.RH.by.model,
                       rawmodelRH = rmse_RH_model,
                       mymodelRH = rmse_RH_mymodel,
                       lmRH = rmse_RH_lm,
                       lmstepRH = rmse_RH_lmstep,
                       glmcvRH = rmse_RH_glmcv
   )
  
  
  #cat("RH model rmse=", rmse_RH_model, "RH lm rmse=", rmse_RH_lm, 
  #     "RH lmstep rmse=", rmse_RH_lmstep,  "RH glmcv rmse=", rmse_RH_glmcv)
 
  results <- list(rmse.pred.by.model = rmse.pred.by.model,
                  rmse.RH.by.model = rmse.RH.by.model)
 return(results)
  
  
}


### The Main Program 
# Find the lm, lmstep and glmcv, linear regression models for  Kumpula station data. 
# We predict the T2 and D2, the air temperature and dewpoint temperature at 2 meters 
# using the above models. We calculate the relative humidly (RH) with the predicted T2 and D2.
# We compare T2, D2 and RH of the  ECMWF model with the pred-cited T2, D2 and RH.


# Initializing variables and data structures
station_id <- 2998
analysis_time <- 1 


rmse.T2D2 <- NULL
rmse.RH <- NULL
stn.season.atime.fperiod <- NULL
rmse.pred.station.season.atime.fperiod  <- NULL
rmse.rh.station.season.atime.fperiod  <- NULL

# loading the Kumpula data from the file data/station2998_T2_D2_Obs_model_data_with_timestamps.csv.
# This data is created using our program "create_weather_data.R"

T2_D2_Obs_model_data_with_timestamps = read.table(file = "data/station2998_T2_D2_Obs_model_data_with_timestamps.csv", sep = ",", header = TRUE)


# We separately model the data for the four seasons and 64 forecast periods (2:64)

for (s  in 1:4) {
  for(f in 2:65) { 

    df <- T2_D2_Obs_model_data_with_timestamps[which(T2_D2_Obs_model_data_with_timestamps$season == s &
                                                      T2_D2_Obs_model_data_with_timestamps$analysis_time == 1 & 
                                                           T2_D2_Obs_model_data_with_timestamps$forecast_period == f ),]
    
# The function CalculateRH, calculate the relative humidity from the predicted T2 and D2 using the linear models.
    
    results <- CalculateRH(df)
    rmse.T2D2 <- rbind(rmse.T2D2, results$rmse.pred.by.model)
    rmse.RH <- rbind(rmse.RH, results$rmse.RH.by.model)
              
    stn.season.atime.fperiod <-  c(station = station_id,
                                   season = s,
                                   atime = analysis_time,
                                   fperiod = f
                                   )
     
      
      
     
      rmse.pred.station.season.atime.fperiod <- rbind(rmse.pred.station.season.atime.fperiod, c(stn.season.atime.fperiod, results$rmse.pred.by.model))                 
      rmse.rh.station.season.atime.fperiod <- rbind(rmse.rh.station.season.atime.fperiod, c(stn.season.atime.fperiod, results$rmse.RH.by.model))
              
    
  }
}

Results

Quantiles of T2 and D2

df.rmse.pred <- as.data.frame(rmse.pred.station.season.atime.fperiod)




for (s in 1:4) { 
    dfname <- paste("df_",s,sep ="")
    assign(dfname, df.rmse.pred[which(df.rmse.pred$season == s),])
}

# Winter 
summary(df_1[,5:14])
##      T2raw         T2mymodel          T2lm          T2lmstep    
##  Min.   :1.314   Min.   :1.167   Min.   :1.102   Min.   :1.126  
##  1st Qu.:1.887   1st Qu.:1.597   1st Qu.:1.612   1st Qu.:1.610  
##  Median :2.289   Median :2.028   Median :2.065   Median :2.004  
##  Mean   :2.654   Mean   :2.360   Mean   :2.377   Mean   :2.357  
##  3rd Qu.:3.415   3rd Qu.:3.080   3rd Qu.:3.121   3rd Qu.:3.090  
##  Max.   :4.953   Max.   :4.366   Max.   :4.377   Max.   :4.361  
##     T2glmcv          D2raw         D2mymodel          D2lm      
##  Min.   :1.102   Min.   :1.294   Min.   :1.090   Min.   :1.109  
##  1st Qu.:1.612   1st Qu.:1.957   1st Qu.:1.824   1st Qu.:1.807  
##  Median :2.065   Median :2.449   Median :2.357   Median :2.410  
##  Mean   :2.377   Mean   :2.914   Mean   :2.722   Mean   :2.772  
##  3rd Qu.:3.121   3rd Qu.:3.924   3rd Qu.:3.744   3rd Qu.:3.796  
##  Max.   :4.377   Max.   :5.660   Max.   :4.977   Max.   :5.118  
##     D2lmstep        D2glmcv     
##  Min.   :1.095   Min.   :1.109  
##  1st Qu.:1.846   1st Qu.:1.807  
##  Median :2.409   Median :2.410  
##  Mean   :2.750   Mean   :2.772  
##  3rd Qu.:3.795   3rd Qu.:3.796  
##  Max.   :5.079   Max.   :5.118
# Spring
summary(df_2[,5:14])
##      T2raw         T2mymodel          T2lm          T2lmstep     
##  Min.   :1.384   Min.   :1.040   Min.   :1.014   Min.   :0.9916  
##  1st Qu.:2.116   1st Qu.:1.466   1st Qu.:1.482   1st Qu.:1.4602  
##  Median :2.446   Median :1.932   Median :1.946   Median :1.9413  
##  Mean   :2.643   Mean   :2.083   Mean   :2.079   Mean   :2.0663  
##  3rd Qu.:3.067   3rd Qu.:2.516   3rd Qu.:2.514   3rd Qu.:2.5251  
##  Max.   :4.215   Max.   :3.937   Max.   :3.725   Max.   :3.6880  
##     T2glmcv          D2raw         D2mymodel          D2lm      
##  Min.   :1.014   Min.   :1.408   Min.   :1.164   Min.   :1.192  
##  1st Qu.:1.482   1st Qu.:2.150   1st Qu.:1.870   1st Qu.:1.917  
##  Median :1.946   Median :2.639   Median :2.502   Median :2.550  
##  Mean   :2.079   Mean   :2.867   Mean   :2.667   Mean   :2.700  
##  3rd Qu.:2.514   3rd Qu.:3.475   3rd Qu.:3.344   3rd Qu.:3.374  
##  Max.   :3.725   Max.   :4.987   Max.   :4.621   Max.   :4.494  
##     D2lmstep        D2glmcv     
##  Min.   :1.196   Min.   :1.192  
##  1st Qu.:1.944   1st Qu.:1.917  
##  Median :2.529   Median :2.550  
##  Mean   :2.684   Mean   :2.700  
##  3rd Qu.:3.355   3rd Qu.:3.374  
##  Max.   :4.463   Max.   :4.494
# Summer
summary(df_3[,5:14])
##      T2raw         T2mymodel          T2lm          T2lmstep     
##  Min.   :1.287   Min.   :1.065   Min.   :1.002   Min.   :0.9936  
##  1st Qu.:1.805   1st Qu.:1.395   1st Qu.:1.370   1st Qu.:1.3366  
##  Median :2.237   Median :1.933   Median :1.838   Median :1.8141  
##  Mean   :2.327   Mean   :1.996   Mean   :1.935   Mean   :1.9286  
##  3rd Qu.:2.794   3rd Qu.:2.515   3rd Qu.:2.475   3rd Qu.:2.4486  
##  Max.   :3.912   Max.   :3.491   Max.   :3.325   Max.   :3.3101  
##     T2glmcv          D2raw         D2mymodel           D2lm       
##  Min.   :1.002   Min.   :1.048   Min.   :0.8599   Min.   :0.8654  
##  1st Qu.:1.370   1st Qu.:1.738   1st Qu.:1.5890   1st Qu.:1.6299  
##  Median :1.838   Median :2.260   Median :2.1856   Median :2.1852  
##  Mean   :1.935   Mean   :2.508   Mean   :2.2926   Mean   :2.3053  
##  3rd Qu.:2.475   3rd Qu.:3.150   3rd Qu.:2.9152   3rd Qu.:2.9241  
##  Max.   :3.325   Max.   :5.043   Max.   :4.1308   Max.   :4.0558  
##     D2lmstep         D2glmcv      
##  Min.   :0.8583   Min.   :0.8654  
##  1st Qu.:1.6431   1st Qu.:1.6299  
##  Median :2.1973   Median :2.1852  
##  Mean   :2.2922   Mean   :2.3053  
##  3rd Qu.:2.8709   3rd Qu.:2.9241  
##  Max.   :4.0577   Max.   :4.0558
# Autumn
summary(df_4[,5:14])
##      T2raw         T2mymodel          T2lm          T2lmstep     
##  Min.   :1.169   Min.   :1.066   Min.   :1.018   Min.   :0.9544  
##  1st Qu.:1.522   1st Qu.:1.385   1st Qu.:1.389   1st Qu.:1.3824  
##  Median :1.990   Median :1.924   Median :1.946   Median :1.9283  
##  Mean   :2.180   Mean   :2.044   Mean   :2.053   Mean   :2.0400  
##  3rd Qu.:2.648   3rd Qu.:2.581   3rd Qu.:2.488   3rd Qu.:2.5093  
##  Max.   :4.642   Max.   :4.021   Max.   :4.015   Max.   :3.9945  
##     T2glmcv          D2raw          D2mymodel           D2lm       
##  Min.   :1.018   Min.   :0.9917   Min.   :0.8148   Min.   :0.8798  
##  1st Qu.:1.389   1st Qu.:1.5181   1st Qu.:1.4796   1st Qu.:1.5017  
##  Median :1.946   Median :2.3790   Median :2.1101   Median :2.1440  
##  Mean   :2.053   Mean   :2.5061   Mean   :2.2894   Mean   :2.3138  
##  3rd Qu.:2.488   3rd Qu.:3.2584   3rd Qu.:2.8153   3rd Qu.:2.8787  
##  Max.   :4.015   Max.   :5.4251   Max.   :4.4240   Max.   :4.4736  
##     D2lmstep         D2glmcv      
##  Min.   :0.8224   Min.   :0.8798  
##  1st Qu.:1.4863   1st Qu.:1.5017  
##  Median :2.1379   Median :2.1440  
##  Mean   :2.2904   Mean   :2.3138  
##  3rd Qu.:2.8396   3rd Qu.:2.8787  
##  Max.   :4.5307   Max.   :4.4736
df.rmse.rh <-  as.data.frame(rmse.rh.station.season.atime.fperiod)
for (s in 1:4) { 
    dfname <- paste("df_",s,sep ="")
    assign(dfname, df.rmse.rh[which(df.rmse.rh$season == s),])
}

# Winter 
summary(df_1[,5:9])
##    rawmodelRH       mymodelRH          lmRH          lmstepRH    
##  Min.   : 3.110   Min.   :1.373   Min.   :1.356   Min.   :1.373  
##  1st Qu.: 5.404   1st Qu.:2.318   1st Qu.:2.326   1st Qu.:2.322  
##  Median : 6.683   Median :2.634   Median :2.578   Median :2.609  
##  Mean   : 6.737   Mean   :2.790   Mean   :2.723   Mean   :2.724  
##  3rd Qu.: 7.722   3rd Qu.:3.227   3rd Qu.:3.137   3rd Qu.:3.111  
##  Max.   :10.804   Max.   :4.802   Max.   :4.571   Max.   :4.689  
##     glmcvRH     
##  Min.   :1.339  
##  1st Qu.:2.207  
##  Median :2.475  
##  Mean   :2.656  
##  3rd Qu.:3.065  
##  Max.   :4.428
# Spring
summary(df_2[,5:9])
##    rawmodelRH       mymodelRH          lmRH          lmstepRH    
##  Min.   : 5.093   Min.   :1.747   Min.   :1.687   Min.   :1.673  
##  1st Qu.: 7.816   1st Qu.:2.916   1st Qu.:2.863   1st Qu.:2.849  
##  Median :10.135   Median :3.894   Median :3.784   Median :3.812  
##  Mean   :10.141   Mean   :3.969   Mean   :3.919   Mean   :3.913  
##  3rd Qu.:11.898   3rd Qu.:4.792   3rd Qu.:4.732   3rd Qu.:4.733  
##  Max.   :17.399   Max.   :7.019   Max.   :6.879   Max.   :6.845  
##     glmcvRH     
##  Min.   :1.655  
##  1st Qu.:2.826  
##  Median :3.768  
##  Mean   :3.905  
##  3rd Qu.:4.672  
##  Max.   :6.921
# Summer
summary(df_3[,5:9])
##    rawmodelRH       mymodelRH          lmRH          lmstepRH    
##  Min.   : 4.069   Min.   :1.537   Min.   :1.452   Min.   :1.449  
##  1st Qu.: 7.496   1st Qu.:2.710   1st Qu.:2.738   1st Qu.:2.739  
##  Median : 9.545   Median :3.383   Median :3.370   Median :3.334  
##  Mean   : 9.510   Mean   :3.649   Mean   :3.624   Mean   :3.605  
##  3rd Qu.:11.905   3rd Qu.:4.693   3rd Qu.:4.607   3rd Qu.:4.624  
##  Max.   :15.656   Max.   :5.683   Max.   :5.852   Max.   :5.766  
##     glmcvRH     
##  Min.   :1.432  
##  1st Qu.:2.706  
##  Median :3.311  
##  Mean   :3.623  
##  3rd Qu.:4.575  
##  Max.   :5.837
# Autumn
summary(df_4[,5:9])
##    rawmodelRH       mymodelRH          lmRH          lmstepRH    
##  Min.   : 3.115   Min.   :1.377   Min.   :1.253   Min.   :1.249  
##  1st Qu.: 4.545   1st Qu.:1.853   1st Qu.:1.814   1st Qu.:1.816  
##  Median : 5.852   Median :2.250   Median :2.206   Median :2.234  
##  Mean   : 5.831   Mean   :2.365   Mean   :2.334   Mean   :2.311  
##  3rd Qu.: 6.798   3rd Qu.:2.727   3rd Qu.:2.635   3rd Qu.:2.608  
##  Max.   :10.256   Max.   :4.360   Max.   :4.244   Max.   :4.098  
##     glmcvRH     
##  Min.   :1.235  
##  1st Qu.:1.821  
##  Median :2.196  
##  Mean   :2.289  
##  3rd Qu.:2.672  
##  Max.   :4.149

Plots : rmse in predictions of T2 and D2 for Winter data

We plot the rmse in predictions of T2 and D2 using the T2 and D2 obtained from ECMWF model and also from the linear models mymodel, lm, lmstep and glmcv for the winter data. The plots show that the rmse in T2 and D2 are smaller in our linear predictions models than the T2 and D2 from ECMWF model. The model mymodel predictions are comparable that of other linear regression models.

require(ggplot2)
require(reshape)

df.rmse.pred <- as.data.frame(rmse.pred.station.season.atime.fperiod)
df_winter_pred_model <- df.rmse.pred[which(df.rmse.pred$season == 1),]


df_winter_T2_pred_model <- df_winter_pred_model[, c(4:9)]
df_winter_D2_pred_model <- df_winter_pred_model[, c(4,10:14)]
fperiod <- df_winter_T2_pred_model$fperiod


molten <- melt(df_winter_T2_pred_model, id.vars = 'fperiod', variable_name = "model")
ggplot(molten, aes(x=fperiod, y=value, colour = model )) + geom_line()+xlab("forecast period") + ylab("RMSE") + ggtitle("RMSE in T2 prediction with  linear regression models for the winter season")

molten <- melt(df_winter_D2_pred_model, id.vars = 'fperiod', variable_name = "model")
ggplot(molten, aes(x=fperiod, y=value, colour = model )) + geom_line()+xlab("forecast period") + ylab("RMSE") + ggtitle("RMSE in D2 prediction with  linear regression models for the winter season")

Plots : rmse in RH for the Winter data

We plot the rmse in RH which depends on the T2 and D2. We plot the RH based on the T2 and D2 obtained from ECMWF model and also from the linear models mymodel, lm, lmstep and glmcv. We see that the the errors in T2 and D2 are amplified in RH. The plots show that the rmse is reduced by at most half in our linear predictions model.

df.rmse.rh <-  as.data.frame(rmse.rh.station.season.atime.fperiod)
df_winter_rh_model <- df.rmse.rh[which(df.rmse.rh$season == 1),]

df_winter_rh_model <- df_winter_rh_model[, c(4:9)]

molten <- melt(df_winter_rh_model, id.vars = 'fperiod', variable_name = "model")
ggplot(molten, aes(x=fperiod, y=value, colour = model )) + geom_line()+xlab("forecast period") + ylab("RMSE") + ggtitle("RMSE in RH with  linear regression models for the winter season")

RMSE T2 in all seasons

The three plots below give the quantile values of prediction errors in T2, D2 and RH in all seasons. we can see that mymodel has the lowest quantile values for T2 and similar values for D2. The median RH value for the ECMWF model is almost double as the RH value obtained from the models.

df.rmse.pred.station.fperiod <- df.rmse.pred[,c(2, 4:9)]

molten <- melt(df.rmse.pred.station.fperiod , id.vars = c("season", "fperiod"), variable_name = "model")
ggplot(molten, aes(x= season, y=value, colour = model )) + geom_boxplot()+xlab("seasons(1:4)") + ylab("RMSE") + ggtitle("RMSE in T2 with  linear regression models for four seasons")

RMSE D2 in all seasons

df.rmse.pred.station.fperiod <- df.rmse.pred[,c(2, 4,10:14)]

molten <- melt(df.rmse.pred.station.fperiod , id.vars = c("season", "fperiod"), variable_name = "model")
ggplot(molten, aes(x= season, y=value, colour = model )) + geom_boxplot()+xlab("seasons(1:4)") + ylab("RMSE") + ggtitle("RMSE in D2 with  linear regression models for four seasons")

RMSE in all seasons

df.rmse.rh.station.fperiod <- df.rmse.rh[,c(2, 4:9)]

molten <- melt(df.rmse.rh.station.fperiod , id.vars = c("season", "fperiod"), variable_name = "model")
ggplot(molten, aes(x= season, y=value, colour = model )) + geom_boxplot()+xlab("seasons(1:4)") + ylab("RMSE") + ggtitle("RMSE in RH with  linear regression models for four seasons")

Conclusions and future directions

Statistical postprocessing is a method to improve the direct model output from numerical weather prediction (NWP) models. In this assignment we use linear regression models that use observed air temperature at 2 meters (T2Obs) and the dewpoint temperature at 2 meters (D2Obs) to improve the forecast capability of the NWP models. We develop a linear regression model called mymodel, for the air temperature at 2 meters (T2) and the dew point temperature at 2 meters (D2). We compare our model with the standard models available in R such as lm, step and glm. We then use the predicted T2 and D2 to calculate the relative humidity. the root mean square error (rmse) for the RH for ECMWF NWP model and for the linear models. The results show that linear regression models reduces the rmse in RH at most by a factor of 2. The data used for analysis is the weather data for the Kumpula weather station for the four seasons, for 64 forecast periods produced by ECMWF and the T2Obs and D2Obs available at the Finnish Meteorological Institute.

Our study shows that statistical postprocessing improves the prediction capability of a model. We observed that the prediction errors in variable is amplified/influenced by the prediction errors in its dependent variables. In this study we use only linear regression models. In future we would like to experiment the problem of interdependence with principle component analysis (PCA) and principle component regression (PCR) to predict the variables and their dependent variables.

Acknowledgments

I am grateful to the POSSE project manager Mikko Rauhala (mikko.rauhala@fmi.fi) for allowing me to present the work I have been doing in the project as my final assignment for the IODS course. I am grateful to Jussi Ylhäisi for giving the problem of finding the interdependence of variables, providing me with the data on Kumpula weather station and for the everyday discussions on weather data and modelling. My deep gratitude goes to the IODS course instructors, my fellow students whose assignments I have peer reviewed and who have reviewed my course work. They have taught me how to understand a data science problem, comprehend the variables in the problem and analyze the problem deeply.